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Abstract 

When delegating computation to a service provider, as in the cloud computing paradigm, we seek 
some reassurance that the output is correct and complete. Yet recomputing the output as a check is 
inefficient and expensive, and it may not even be feasible to store all the data locally. We are therefore 
interested in what can be validated by a streaming (sublinear space) user, who cannot store the full input, 
or perform the full computation herself. Our aim in this work is to advance a recent line of work on 
"proof systems" in which the service provider proves the correctness of its output to a user. The goal is 
to minimize the time and space costs of both parties in generating and checking the proof. Only very 
recently have there been attempts to implement such proof systems, and thus far these have been quite 
limited in functionahty. 

Here, our approach is two-fold. First, we describe a carefully chosen instantiation of one of the 
most efficient general-purpose constructions for arbitrary computations (streaming or otherwise), due to 
Goldwasser, Kalai, and Rothblum [19]. This requires several new insights and enhancements to move 
the methodology from a theoretical result to a practical possibility. Our main contribution is in achieving 
a prover that runs in time 0{S{n) log S{n)), where S{n) is the size of an arithmetic circuit computing 
the function of interest; this compares favorably to the poly(5'(n)) runtime for the prover promised 
in [19|. Our experimental results demonstrate that a practical general-piupose protocol for verifiable 
computation may be significantly closer to reality than previously reaUzed. 

Second, we describe a set of techniques that achieve genuine scalability for protocols fine-tuned 
for specific important problems in streaming and database processing. Focusing in particular on non- 
interactive protocols for problems ranging from matrix-vector multiplication to bipartite perfect match- 
ing, we build on prior work [8, 15] to achieve a prover that runs in nearly linear-time, while obtaining 
optimal tradeoffs between communication cost and the user's working memory. Existing techniques re- 
quired (substantially) superlinear time for the prover. Finally, we develop improved interactive protocols 
for specific problems based on a linearization technique originally due to Shen [34]. We argue that even 
if general-purpose methods improve, fine-tuned protocols will remain valuable in real-world settings for 
key problems, and hence special attention to specific problems is warranted. 

1 Introduction 

One obvious impediment to larger-scale adoption of cloud computing solutions is the matter of trust. In 
this paper, we are specifically concerned with trust regarding the integrity of outsourced computation. If we 
store a large data set with a service provider, and ask them to perform a computation on that data set, how 
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can the provider convince us the computation was performed correctly? Even assuming a non-malicious 
service provider, errors due to faulty algorithm implementation, disk failures, or memory read errors are not 
uncommon, especially when operating on massive data. 

A natural approach, which has received significant attention particularly within the theory community, 
is to require the service provider to provide a proof along with the answer to the query. Adopting the 
terminology of proof systems [2], we treat the user as a verifier V, who wants to solve a problem with the 
help of the service provider who acts as a prover V. After V returns the answer, the two parties conduct a 
conversation following an established protocol that satisfies the following property: an honest prover will 
always convince the verifier to accept its results, while any dishonest or mistaken prover will almost certainly 
be rejected by the verifier. This model has led to many interesting theoretical techniques in the extensive 
literature on interactive proofs. However, the bulk of the foundational work in this area assumed that the 
verifier can afford to spend polynomial time and resources in verifying a prover's claim to have solved a 
hard problem (e.g. an NP-complete problem). In our setting, this is too much: rather, the prover should 
be efficient, ideally with effort close to linear in the input size, and the verifier should be lightweight, with 
effort that is sublinear in the size of the data. 

To this end, we additionally focus on results where the verifier operates in a streaming model, taking a 
single pass over the input and using a small amount of space. This naturally fits the cloud setting, as the 
verifier can perform this streaming pass while uploading the data to the cloud. For example, consider a 
retailer who forwards each transaction incrementally as it occurs. We model the data as too large for the 
user to even store in memory, hence the need to use the cloud to store the data as it is collected. Later, the 
user may ask the cloud to perform some computation on the data. The cloud then acts as a prover, sending 
both an answer and a proof of integrity to the user, keeping in mind the user's space restrictions. 

We believe that such mechanisms are vital to expand the commercial viability of cloud computing ser- 
vices by allowing a trust-but-verify relationship between the user and the service provider. Indeed, even 
if every computation is not expUcitly checked, the mere abiUty to check the computation could stimulate 
users to adopt cloud computing solutions. Hence, in this paper, we focus on the issue of the practicaUty of 
streaming verification protocols. 

There are many relevant costs for such protocols. In the streaming setting, the main concern is the space 
used by the verifier and the amount of communication between V and V. Other important costs include the 
space and time cost to the prover, the runtime of the verifier, and the total number of messages exchanged 
between the two parties. If any one of these costs is too high, the protocol may not be useful in real-world 
outsourcing scenarios. 

In this work, we take a two-pronged approach. Ideally, we would Uke to have a general-purpose method- 
ology that allows us to construct an efficient protocol for an arbitrary computation. We therefore examine the 
costs of one of the most efficient general-purpose protocols known in the literature on interactive proofs, due 
to Goldwasser, Kalai, and Rothblum [19]. We describe an efficient instantiation of this protocol, in which 
the prover is significantly faster than in prior work, and present several modifications which we needed 
to make our implementation scalable. We believe our success in implementing this protocol demonstrates 
that a fully practical method for reliable delegation of arbitrary computation is much closer to reality than 
previously realized. 

Although encouraging, our general-purpose implementation is not yet practical for everyday use. Hence, 
our second line of attack is to improve upon the general construction via specialized protocols for a large 
subset of important problems. Here, we describe two techniques in particular that yield significantly more 
scalable protocols than previously known. First, we show how to use certain Fast Fourier Transforms to 
obtain highly scalable non-interactive protocols that are suitable for practice today; these protocols require 
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just one message from V to V, and no communication in the reverse direction. Second, we describe how 
to use a 'linearization' method applied to polynomials to obtain improved interactive protocols for certain 
problems. All of our work is backed by empirical evaluation based on our implementations. 

Depending on the technique and the problem in question, we see empirical results that vary in speed by 
five orders of magnitude in terms of the cost to the prover. Hence, we argue that even if general-purpose 
methods improve, fine-tuned protocols for key problems will remain valuable in real-world settings, espe- 
cially as these protocols can be used as primitives in more general constructions. Therefore, special attention 
to specific problems is warranted. The other costs of providing proofs are acceptably low. For many prob- 
lems our methods require at most a few megabytes of space and communication even when the input consists 
of terabytes of data, and some use much less; moreover, the time costs of V and V scale linearly or almost 
linearly with the size of the input. Most of our protocols require a polylogarithmic number of messages 
between V and V, but a few are non-interactive, and send just one message. 

To summarize, we view the contributions of this paper as: 

• A carefully engineered general-purpose implementation of the circuit checking construction of [19], along 
with some extensions to this protocol. We believe our results show that a practical delegation protocol for 
arbitrary computations is significantly closer to reality than previously realized. 

• The development of powerful and broadly applicable methods for obtaining practical specialized pro- 
tocols for large classes of problems. We demonstrate empirically that these techniques easily scale to 
streams with billions of updates. 

1.1 Previous Work 

The concept of an interactive proof was introduced in a burst of activity around twenty years ago [3, 20, 
25, 33, 34]. This culminated in a celebrated result of Shamir [33], which showed that the set of problems 
with efficient interactive proofs is exactly the set of problems that can be computed in polynomial space. 
However, these results were primarily seen as theoretical statements about computational complexity, and 
did not lead to implementations. More recently, motivated by real-world applications involving the dele- 
gation of computation, there has been considerable interest in proving that the cloud is operating correctly. 
For example, one line of work considers methods for proving that data is being stored without errors by an 
external source such as the cloud, e.g., [22]. 

In our setting, we model the verifier as capable of accessing the data only via a single, streaming pass. 
Under this constraint, there has been work in the database community on ensuring that simple functions 
based on grouping and counting are performed correctly; see [37] and the references therein. Other similar 
examples include work on verifying queries on a data stream with sliding windows using Merkle trees [24] 
and verifying continuous queries over streaming data [29]. 

Most relevant to us is work which verifies more complex and more general functions of the input. The 
notion of a streaming verifier, who must read first the input and then the proof under space constraints, 
was formalized by Chakrabarti et al. [8] and extended by the present authors in [15]. These works allowed 
the prover to send only a single message to the verifier, with no communication in the reverse direction. 
With similar motivations, Goldwasser et al. [19] give a powerful protocol that achieves a polynomial time 
prover and highly-efficient verifier for a large class of problems, although they do not explicitly present 
their protocols in a streaming setting. Subsequently, it has been noted that the information required by the 
verifier can be collected with a single initial streaming pass, and so for a large class of uniform computations, 
the verifier operates with only polylogarithmic space and time. Finally, Cormode et al. [17] introduce the 
notion of streaming interactive proofs, extending the model of [8] by allowing multiple rounds of interaction 
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between prover and verifier. Tiiey present exponentially clieaper protocols than those possible in the single- 
message model of [8, 15], for a variety of problems of central importance in database and stream processing. 

A different line of work has used fully homomorphic encryption to ensure integrity, privacy, and reusabil- 
ity in delegated computation [18, 12, 11]. The work of Chung, Kalai, Liu, and Raz [11] is particularly 
related, as they focus on delegation of streaming computation. Their results are stronger than ours, in that 
they achieve reusable general-purpose protocols (even if V learns whether V accepts or rejects each proof), 
but their soundness guarantees rely on computational assumptions, and the substantial overhead due to the 
use of fully homomorphic encryption means that these protocols remain far from practical at this time. 

Only very recently have there been sustained efforts to use techniques derived from the complexity and 
cryptography worlds to actually verify computations. Bhattacharyya implements certain PCP constructions 
and indicates they may be close to practical [4]. In parallel to this work, Setty et al. [31, 32] survey 
results on probabilistically checkable proofs (PCPs), and implement a construction originally due to Ishai 
et al. [21]. While their work represents a clear advance in the implementation of PCPs, our approach has 
several advantages over [31, 32]. For example, our protocols save space and time for the verifier even when 
outsourcing a single computation, while [31, 32] saves time for the verifier only when batching together 
several dozen computations at once and amortizing the verifier's cost over the batch. Moreover, our protocols 
are unconditionally secure even against computationally unbounded adversaries, while the construction of 
Ishai et al. relies on cryptographic assumptions to obtain security guarantees. Another practically-motivated 
approach is due to Canetti et al. [7]. Their implementation delegates the computation to two independent 
provers, and "plays them off' against each other: if they disagree on the output, the protocol identifies 
where their executions diverge, and favors the one which follows the program correctly at the next step. 
This approach requires at least one of the provers to be honest for any security guarantee to hold. 

1.2 Preliminaries 

Definitions. We first formally define a vaUd protocol. Here we closely follow previous work, such as [17] 
and [8]. 

Definition 1.1 Consider a prover V and verifier V who both observe a stream A and wish to compute a 
fimction f{ A). After the stream is observed, V and V exchange a sequence of messages. Denote the output 
ofV on input A, given prover V and Vs random bits TZ, by out(V, A, TZ, V). V can output -LifV is not 
convinced that V 's claim is valid. 

V is a valid prover with respect to V if for all streams A 

Pr[out(VM,^,^) = = 1. 

We call V a valid verifier for f if there is at least one valid prover V with respect to V, and for all provers 
V' and all streams A 

FT[out{V,A,n,V') {f{A),±}] < 1/3. 

Essentially, this definition states that a prover who follows the protocol correctly will always convince 
V, while if V makes any mistakes or false claims, then this will be detected with at least constant probability. 
In fact, for our protocols, this 'false positive' probability can easily be made arbitrarily small. 

As our first concern in a streaming setting is the space requirements of the verifier as well as the com- 
munication cost for the protocol, we make the following definition. 
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Definition 1.2 We say f possesses an r-message {h, v) protocol, if there exists a valid verifier V for f such 
that: 

1. V has access to only 0{v) words of working memory. 

2. There is a valid prover V for V such that V and V exchange at most r messages in total, and the sum 
of the lengths of all messages is 0{h) words. 

We refer to one-message protocols as non-interactive. We say an r-message protocol has \r/2] rounds. 

A key step in many proof systems is the evaluation of the low-degree extension of some data at multiple 
points. That is, the data is interpreted as implicitly defining a polynomial function which agrees with the 
data over the range 1 . . . n, and which can also be evaluated at points outside this range as a check. The 
existence of streaming verifiers relies on the fact that such low-degree extensions can be evaluated at any 
given location incrementally as the data is presented [17]. 

Input Representation. All protocols presented in this paper can handle inputs specified in a very general 

data stream form. Each element of the stream is a tuple (i,5), where i G [n] and S is an integer (which may 
be negative, thereby modeling deletions). The data stream implicitly defines a frequency vector a, where Oj 
is the sum of all 6 values associated with i in the stream, and the goal is to compute a function of a. Notice 
the function of a to be computed may interpret a as an object other than a vector, such as a matrix or a string. 
For example, in the MVMULT problem described below, a defines a matrix and a vector to be multiplied, 
and in some of the graph problems considered as extensions in Section 2, a defines the adjacency matrix of 
a graph. 

In Sections 2 and 3, the manner in which we describe protocols may appear to assume that the data 
stream has been pre- aggregated into the frequency vector a (for example, in Section 3, we apply the protocol 
of Goldwasser et al. [19] to arithmetic circuits whose i'th input wire has value Oj). It is therefore important 
to emphasize that in fact all of the protocols in this paper can be executed in the input model of the previous 
paragraph, where V only sees the raw (unaggregated) stream and not the aggregated frequency vector a, 
and there is no explicit conversion between the raw stream and the aggregated vector a. This follows from 
observations in [8, 17], which we describe here for completeness. 

The critical observation is that in all of our protocols, the only information V must extract from the data 
stream is the evaluation of a low-degree extension of a at a random point r, which we denote by LDEa(r), 
and this value can be computed incrementally by V using 0(1) words of memory as the raw stream is 
presented to V. Crucially this is possible because, for fixed r, the function a i— LDEa(r) is linear, and thus 
it is straightforward for V to compute the contribution of each update (i, 6) to LDEa(r). 

More precisely, we can write LDEa(r) = X]ie[n] «iXi('^)' where Xi is a (Lagrange) polynomial that 
depends only on i. Thus, V can compute LDEa(r) incrementally from the raw stream by initializing 
LDEa(r) 0, and processing each update (i, 6) via: 

LDE„(r)^LDEa(r) + %(r). 

V only needs to store LDEa(r) and r, which requires 0(1) words of memory. Moreover, for any i, Xilr) 
can be computed in 0(log n) field operations, and thus V can compute LDEo(r) with one pass over the raw 
stream, using 0(1) words of space and log n field operations per update. 

Problems. To focus our discussion and experimental study, we describe four key problems that capture 
different aspects of computation: data grouping and aggregation, linear algebra, and pattern matching. We 
will study how to build valid protocols for each of these problems. Throughout, let [n] = {0, . . . , — 1} 
denote the universe from which data elements are drawn. 
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F2'. Given a stream of m elements from [n\, compute af where is the number of occurrences of 

i in the stream. This is also known as the second frequency moment, a special case of the feth frequency 
moment Ffe = Y.ie[a] 

Fq\ Given a stream of m elements from [n], compute the number of distinct elements, i.e. the number of i 
with ai > 0, where again is the number of occurrences of i in the stream. 

MVMULT: Given a stream defining an n x n integer matrix A, and vectors x, b G Z", determine whether 
Ak = b. More generally, we are interested in the case where V provides a vector b which is claimed to 
be ^x. This is easily handled by our protocols, since V can treat the provided b as part of the input, even 
though it may arrive after the rest of the input. 

PMwW: Given a stream representing text T = (io, • • • , tn-i) G and pattern P = {po, . . . ,Pq-i) G 
[n]'^, the pattern P is said to occur at location iint if, for every position j in P, either = ti+j or at least 
one of pj and t^+j is the wildcard symbol *. The pattern-matching with wildcards problem is to determine 
the number of locations at which P occurs in T. 

For simpUcity, we will assume the stream length n and the universe size m are on the same order of 
magnitude i.e. m = G(n). 

All four problems require linear space in the streaming model to solve exactly (although there are space- 
efficient approximation algorithms for the first three [28]). 

Non-interactive versus Multi-round Protocols. Protocols for reliable delegation fall into two classes: 
non-interactive, in which a single message is sent from prover to verifier and no communication occurs 
in the reverse direction; and multi-round, where the two parties have a sustained conversation, possibly 
spanning hundreds of rounds or more. There are merits and drawbacks to each. 

— Non-interactive Advantages: The non-interactive model has the desirable property that the prover can 
compute the proof and send it to the verifier (in an email, or posted on a website) for V to retrieve and 
validate at her leisure. In contrast, the multi-round case requires V and V to interact online. Due to round- 
trip delays, the time cost of multi-round protocols can become high; moreover, V may have to do substantial 
computation after each message. This can involve maintaining state between messages, and performing 
many passes over the data. A less obvious advantage is that non-interactive protocols can be repeated 
for different instances (e.g. searching for different patterns in PMwW) without requiring V to use fresh 
randomness. This allows the verifier to amortize much of its time cost over many queries, potentially 
achieving sublinear time cost per query. The reason this is possible is that in the course of a non-interactive 
protocol, V learns nothing about V's private randomness (assuming V does not learn whether V accepts or 
rejects the proof) and so we can use a union bound to bound the probability of error over multiple instances. 
In contrast, in the multi-round case, V must divulge most of its private random bits to V over the course of 
the protocol. 

— Multi-round Advantages: The overall cost in a multi-round protocol can be lower, as most non-interactive 
protocols require V to use substantial space and read a large proof. Indeed, prior work [8, 15] has shown 
that space or communication must be ^}{^/n) for most non-interactive protocols [8]. Nonetheless, even for 
terabyte streams of data, these costs typically translate to only a few megabytes of space and communication, 
which is tolerable in many applications. Of more concern is that the time cost to the prover in known non- 
interactive protocols is typically much higher than in the interactive case, though this gap is not known to 
be inherent. We make substantial progress in closing this gap in prover runtime in Section 2, but this still 
leaves an order of magnitude difference in practice (Section 5). 
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1.3 Outline and Contributions 



We consider non-interactive protocols first, and interactive protocols second. To begin, we describe in 
Section 2 how to use Fast Fourier Transform methods to engineer P's runtime in the F2 protocol of [8] 
down from 0(n^/^) to nearly-linear time. The F2 protocol is a key target, because (as we describe) several 
protocols build directly upon it. We show in Section 5 that this results in a speedup of hundreds of thousands 
of updates per second, bringing this protocol, as well as those that build upon it, from theory to practice. 

Turning to interactive protocols, in Section 2.1 we describe an efficient instantiation of the general- 
purpose construction of [19]. Here, we also describe efficient protocols for specific problems of high interest 
including Fq and PMwW based on an application of our implementation to carefully chosen circuits. The 
latter protocol enables verifiable searching (even with wildcards) in the cloud, and complements work on 
searching in encrypted data within the cloud (e.g. [5]). Our final contribution in this section is to demonstrate 
that the use of more general arithmetic gates to enhance the basic protocol of [19] allows us to significantly 
decrease prover time, communication cost, and message cost of these two protocols in practice. 

In Section 4 we provide alternative interactive protocols for important specific problems based on a 
technique known as linearization; we demonstrate in Section 5 that linearization yields a protocol for Fq 
in which V runs nearly two orders of magnitude faster than in all other known protocols for this problem. 
Finally, we describe our observations on implementing these different methods, including our carefully 
engineered implementation of the powerful general-purpose construction of [19]. 

2 Fast Non-interactive Proofs via Fast Fourier Transforms 

In this section, we describe how to drastically speed up P's computation for a large class of specialized, 
non-interactive protocols. In non-interactive proofs, V often needs to evaluate a low-degree extension at a 
large number of locations, which can be the bottleneck. Here, we show how to reduce the cost of this step 
to near linear, via Fast Fourier Transform (FFT) methods. 

For concreteness, we describe the approach in the context of a non-interactive protocol for F2 given in 
[8]. Initial experiments on this protocol identified the prover' s runtime as the principal bottieneck in the 
protocol [17]. In this implementation, V required 9(n^/^) time, and consequentiy the implementation fails 
to scale for n > 10^. Here, we show that FFT techniques can dramatically speed up the prover, leading to a 
protocol that easily scales to streams consisting of billions of items. 

We point out that F2 is a problem of significant interest, beyond being a canonical streaming problem. 
Many existing protocols in the non-interactive model are built on top of F2 protocols, including finding 
the inner product and Hamming distance between two vectors [8], the MVMULT problem, solving a large 
class of hnear programs, and graph problems such as testing connectivity and identifying bipartite perfect 
matchings [9, 15]. These protocols are particularly important because they all achieve provably optimal 
tradeoffs between space and communication costs [8]. Thus, by developing a scalable, practical protocol 
for F2, we also achieve big improvements in protocols for a host of important (and seemingly unrelated) 
problems. 

Non-interactive F2 and MVMult Protocols. We first outline the protocol from [8, Theorem 4] for F2 on 
an n dimensional vector. This construction yields an (n", n^~") protocol for any < a < 1, i.e. it allows 
a tradeoff between the amount of communication and space used by V; for brevity we describe the protocol 
when a = 1/2. 

Assume for simplicity that ri is a perfect square. We treat the n dimensional vector as a ^/n x ^/n array 
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a. This implies a two-variate polynomial / over a suitably large finite field ¥p, such that 

V(a;,y) G [y/n\ x [y/n\ : f{x,y) = a^^y 
To compute F2, we wish to compute 

a;e[-v/n],J/e[-v/n] a;6[-v/n],2/e[A/n] 

The low-degree extension / can also be evaluated at locations outside [-v/n] x [y/n\. In the protocol, 
the verifier V picks a random position r G Fp, and evaluates /(r, y) for every y G [y^] ([8] shows how 
V can compute any f{r,y) incrementally in constant space). The proof given by V is in the form of a 
degree 2(y^ — 1) polynomial s{X) which is claimed to be J2y&[^] fi^j 2/)^- ^ uses the values of /(r, y) 
to check that s(r) = X^^^^j^ fif^v)^' and if so accepts X^j.^^^] s{x) as the correct answer. Clearly 
V's check will pass if s is as claimed. The proof of vaUdity follows from the Schwartz-Zippel lemma: if 
s{X) 7^ JZyelVnl /(^' as '^l^^'l by then 



Pr[s(r)= 5] /(r,2/)2] < 



where p is the size of the finite field Fp. Thus, if P deviates at all from the prescribed protocol, the verifier's 
check will fail with high probabihty. 

A non-interactive protocol for MVMULT uses similar ideas. Each entry in the output is the result of an 
inner product between two vectors: a row of matrix A and vector x. Each of the n entries in the output can 
be checked independently with a variation of the above protocol, where the squared values are replaced by 
products of vector entries; this naive approach yields an (n^/^, n^/^) protocol for MVMULT. [15] observes 
that, because x is held constant throughout all n inner product computations, V's space requirements can be 
reduced by having V keep track of hashed information, rather than full vectors. The messages from V do not 
change, however, and computing low-degree extensions of the input data is the chief scalability bottleneck. 
This construction yields a 1-message (n^"*"", n^~") protocol (as in Definition 1.2) for any < a < 1, and 
this can be shown to be optimal. 



2.1 Breaking the bottleneck 

Since s{X) has degree at most — 1 it is uniquely specified by its values at any 2^/n locations. We show 
how V can quickly evaluate all values in the set 

S := {{x,s{x)) : x G [2^^]}. 

Since s{X) = J2yelVri\ /(^' 2^)^' values in set 

T ■■= {{x,y,f{x,y)) : x G [2^/n],y G 

all values in S can be computed in time linear in n. The implementation of [17] calculated each value in T 
independently, requiring 6(n'^/^) time overall. We show how EFT techniques allow us to calculate T much 
faster. 

The task of computing T boils down to multi-point evaluation of the polynomial /. It is known how 
to perform fast multi-point evaluation of univariate degree t polynomials in time 0{tlogt), and bivariate 
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polynomials in subquadratic time, if the polynomial is specified by its coefficients [27]. However, there is 
substantial overhead in converting / to a coefficient representation. It is more efficient for us to directly 
work with and exchange polynomials in an implicit representation, by specifying their values at sufficiently 
many points. 

Representing as a convolution. We are given the values of / at all points located on the [y/n] x [y/n] 
"grid". We leverage this fact to compute T efficiently in nearly linear time by a direct application of the 

Fast Fourier Transform. For (x, y) G [-^72] x [y/n], f{x, y) is just ax^y, which V can store explicitly while 
processing the stream. It remains to calculate (x, y, /(x, y)) for ^/n < x < l^fn. For fixed y G [y^], we 
may write /(X, y) explicitly as 

ie[A/n] 

where Xi is the Lagrange polynomial^ 

a;e[V^\{i} 

If j Wn\, then we may write 

f{j,y)=J2 h{j)hy{i)g{j-i) (1) 
where by{i) = ai^y (x - 

xe[Vri\\{i} 

h{j) = n ^' 

and g{j - i) = {j - . 

As a result /(j, y) can be computed as a circular convolution of by and g, scaled by h{j). That is, for a 
fixed y, all values in the set Ty := {(x, y, /(x, y)) : x G [2-v/rz]} can be found by computing the convolution 
in Equation 1, then scaling each entry by the appropriate value of h{j). 

Computing the Convolution. We represent hy and g by vectors of length 2-y/n over a suitable field, and 
take the Discrete Fourier Transform (DFT) of each. The convolution is the inverse transform of the inner 
product of the two transforms [23, Chapter 5]. There is some freedom to choose the field over which to 
perform the transform. We can compute the DFT of fy and g over the complex field C using 0{^/n log n) 
arithmetic operations via standard techniques such as the Cooley-Tukey algorithm [14], and simply reduce 
the final result modulo p, rounded to the nearest integer. Logarithmically many bits of precision past the 
decimal point suffice to obtain a sufficiently accurate result. Since we compute 0{y/n) such convolutions, 
we obtain the following result: 

Theorem 2.1 The honest prover in the F2 protocol of [8, Theorem 4] requires 0{n log n) arithmetic oper- 
ations on numbers of bit-complexity 0(log n + log;?). 

'That is, the unique polynomial of degree ^/n such that Xi{i) = 1» while for j 7^ t e [\/"-]' XiU) = 0. Here, the inverse is the 
multiplicative inverse within the field. 
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In practice, however, working over C can be slow, and requires us to deal with precision issues. Since 
the original data resides in some finite field ¥p, and can be represented as fixed-precision integers, it is 
preferable to also compute the DFT over the same field. Here, we exploit the fact that in designing our 
protocol, we can choose to work over any sufficiently large finite field Fp. 

There are two issues to address: we need that there exists a DFT for sequences of length 2^/n (or 
thereabouts) in ¥p, and further that this DFT has a corresponding (fast) Fourier Transform algorithm. We 
can resolve both issues with the Prime Factor Algorithm (PFA) for the DFT in Fp [6]. The "textbook" 
Cooley-Turkey FFT algorithm operates on sequences whose length is a power of two. Instead, the PFA 
works on sequences of length N = Ni x N2 x ■ . ■ x Nk, where the Ni's are pairwise coprime. The time 
cost of the transform is 0((^ - Ni)N). The algorithm is typically applied over the complex numbers, but 
also applies over F^: it works by breaking the large DFT up into a sequence of smaller DFTs, each of size 
Ni for some i. These base DFTs for sequences of length iVj exist for Fp whenever there exists a primitive 
iVj'th root of unity in Fp. This is the case whenever Ni is a divisor of p — 1. So we are in good shape so 
long as p — 1 has many distinct prime factors. 

Here, we use our freedom to fix p, and choose p = 2^^ — 1.^ Notice that 

2^1 - 2 = 2 X 32 X 5^ X 7 X 13 X 31 X 41 X 61 X 151 X 331 X 1321, 

and so there are many such divisors iVj to choose from when working over Fp. If 2y/n is not equal to a 
factor of p — 1, we can simply pad the vectors fy and g such that their lengths are factors of 2^^ — 2. Since 
2®^ — 2 has many small factors, we never have to use too much padding: we calculated that we never need 
to pad any sequence of length 100 < N < 10^ (good for n up to 10^®) by more than 16% of its length. This 
is better than the Cooley-Tukey method, where padding can double the length of the sequence. 

As an example, we can work with the length iV = 2x5x7x9xllxl3 = 90090, sufficient for inputs 
of size n = {N/2f, which is over 10^. The cost scales as (2 + 5 + 7 + 9 + 11 + 13)iV = 47A^ Therefore, 
the PFA approach offers a substantial improvement over naive convolution in Fp, which takes time 0(A^^). 

Parallelization. This protocol is highly amenable to parallehzation. Observe that V performs 0(-y/n) 
independent convolutions of each of length 0{^/n) (one for each column y of the matrix ax,y), followed 
by computing J2y ^^,2/ f^r each row x of the result. The convolutions can be done in parallel, and once 
complete, the sum of squares of each row can also be parallelized. This protocol also possesses a simple 
two-round MapReduce protocol. In the first round, we assign each column y of the matrix ax,y a unique 
key, and have each reducer perform the convolution for the corresponding column. In the second round, we 
assign each row x a unique key, and have each reducer compute J2y <^^,y for its row x. 

2.2 Implications 

As we experimentally demonstrate in Section 5, the results of this section make practical the fundamental 
building block for the majority of known non-interactive protocols. Indeed, by combining Theorem 2.1 with 
protocols from [8, 15], we obtain the following immediate corollaries. For all graph problems considered, n 
is the number of nodes in the graph, and m is the number of edges. 

Corollary 2.2 1. (Extending [8, Theorem 4.3]) For any h ■ v > n, there is an {h, v) protocol for com- 
puting the inner product and Hamming distance of two n-dimensional vectors, where V runs in time 
0{n) and V runs in time 0{n log n). The previous best runtime known for V was 0{h^v). 

^Arithmetic in this field can also be done quickly, see Section 5.1. 
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2. (Extending [15, Theorem 4]) For any h-v > n, there is an (mh, v) protocol for m x n integer matrix- 
vector multiplication fMVMULTj, where V runs in time 0(mn) and V runs in time 0{mnlogn). 
The best runtime known for V previously was 0{m,h'^v). 

3. (Extending [15, Corollary 3[) For any h ■ v > n, there is an 0(^nh, v) protocol for solving a linear 
program over n variables with n ( integer) constraints and subdeterminants of polynomial magnitude, 
where V runs in time O(n^) and V runs in time 0(t(n) + log n), where t(n) is the time required to 
solve the linear program and its dual. The best runtime known for V previously was 

4. (Extending [8, Theorem 5.4]) For any h-v > n^, thereisan {h,v) protocol for counting the number of 
triangles in a graph, where V runs in time 0{mn) and V runs in time 0{n^ log n). The best runtime 
known for V previously was 0{h^v). 

5. (Extending [9, Theorem 6.6[) For any h ■ v > r?, h > n, there is an {h,v) protocol for graph 
connectivity, where V runs in time 0{n^ logn) and V runs in time 0{m). The best runtime known 

for V previously was 0{nh'^v). 

6. (Extending [9, Theorem 6.5[) For any h • v > nP, h > n, there is an {h,v) protocol for bipartite 
perfect matching, where V runs in time 0{m) and V runs in time 0{t{n) + ra^ logn), where t{n) 
is the time required to find a perfect matching if one exists, or to find a counter-example (via Hall's 
Theorem) otherwise. The best runtime known for V previously was 0{t{n) + /i^v). 

In the common case where we choose h = v, this represents a polynomial-speed up in Vs runtime. 
For example, for the MVMULT problem, the prover's cost is reduced from 0(mn^/'^) in prior work to 

0{mn logn). 

In most cases of Corollary 2.2, V runs in linear time, and V runs in nearly linear time for dense inputs, 
plus the time t{n) required to solve the problem in the first place, which may be superlinear. Thus, V pays 
at most a logarithmic factor overhead in solving the problem "verifiably", compared to solving the problem 
in a non- verifiable manner. 

3 A General Approach: Multi-round Protocols Via Circuit Checking 

In this section, we study interactive protocols, and describe how to efficiently instantiate the powerful 
framework due to Goldwasser, Kalai, and Rothblum for verifying arbitrary computations-'. 

A standard approach to verified computation developed in the theoretical literature is to verify properties 
of circuits that compute the desired function [18, 19, 31]. One of the most promising of these is due to 
Goldwasser et al, which proves the following result: 

Theorem 3.1 [19[ Let f be a function over an arbitrary field F that can be computed by a family of 
0{\og S{n))-space uniform arithmetic circuits (over¥) of fan-in 2, size S{n), and depth d{n). Then, assum- 
ing unit cost for transmitting or storing a value in F, / possesses a (log S{n), d{n) log S{n)) -protocol re- 
quiring 0{d{n) log S{n)) rounds. V runs in time (n + d{n)) polylog {S{n)) and V runs in time poly(S'(n)). 

Here, an arithmetic circuit over a field F is analogous to a boolean circuit, except that the inputs are 
elements of F rather than boolean values, and the gates of the circuit compute addition and multiplication 
over F. We address how to reahze the protocol of Theorem 3.1 efficiently. Specifically, we show three 

'We are indebted to these authors for sharing their working draft of the full version of [19], which provides much greater detail 
than is possible in the conference presentation. 
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technical results. The first two results, Theorems 3.2 and 3.3, state that for any log-space uniform circuit, 
the honest prover in the protocol of Theorem 3.1 can be made to run in time nearly linear in the size of 
the circuit, with a streaming verifier who uses only O {log S{n) ) words of memory. Thus, these results 
guarantee a highly efficient prover and a space-efficient verifier. In streaming contexts, where V is more 
space-constrained than time-constrained, this may be acceptable. Moreover, Theorem 3.3 states that V can 
perform the time-consuming part of its computation in a data-independent non-interactive preprocessing 
phase, which can occur offline before the stream is observed. 

Our third result. Theorem 3.4 makes a slightly stronger assumption but yields a stronger result: it states 
that under very mild conditions on the circuit, we can achieve a prover who runs in time nearly linear in the 
size of the circuit, and a verifier who is both space- and time-efficient. 

Before stating our theorems, we sketch the main techniques needed to achieve the efficient implemen- 
tation, with full details in Appendix A. We also direct the interested reader to the source code of our 
implementations [16]. The remainder of this section is intended to be reasonably accessible to readers who 
are familiar with the sum-check protocol [33, 25], but not necessarily with the protocol of [19]. 

3.1 Engineering an Efficient Prover 

In the protocol of [19], V and V first agree on a depth d circuit C of gates with fan-in 2 that computes the 
function of interest; C is assumed to be in layered form (this assumption blows up the size of the circuit by 
at most a factor of d, and we argue that it is unrestrictive in practice, as the natural circuits for all four of 
our motivating problems are layered, as well as for a variety of other problems described in Appendix A). 
V begins by claiming a value for the output gate of the circuit. The protocol then proceeds iteratively from 
the output layer of C to the input layer, with one iteration for each layer. For presentation purposes, assume 
that all layers of the circuit have n gates, and let v = log n. 

At a high level, in iteration 1, V reduces verifying the claimed value of the output gate to computing 
the value of a certain Su-variate polynomial /i at a random point G F^^. The iterations then proceed 
inductively over each layer of gates: in iteration i > 1, V reduces computing /i-i(r*^*~^)) for a certain 
3v-variate polynomial to computing fi{A'^^) for a random point r^*) G F^''. Finally, in iteration d, V 
must compute fdir^^^)- This happens to be a function of the input alone (specifically, it is an evaluation of a 
low-degree extension of the input), and V can compute this value in a streaming fashion, without assistance, 
even if only given access to the raw (unaggregated) data stream, as described in Section 1.2. If the values 
agree, then V is convinced of the correctness of the output. 

We abstract the notion of a "wiring predicate", which encodes which pairs of wires from layer i — 1 are 
connected to a given gate at layer i. Each iteration i consists of an application of the standard sum-check 
protocol [25, 33] to a 3u-variate polynomial fi based on the wiring predicate. There is some flexibility 
in choosing the specific polynomial fi to use. This is because the definition of fi involves a low-degree 
extension of the circuit's wiring predicate, and there are many such low-degree extensions to choose from. 

A polynomial is said to be multilinear if it has degree at most one in each variable. The results in this 
section rely critically on the observation that the honest prover's computation in the protocol of [19] can 
be greatly simplified if we use the multilinear extension of the circuit's wiring predicate.^ Details of this 
observation follow. 

As already mentioned, at iteration i of the protocol of [19], the sum-check protocol is applied to the 
3?;-variate polynomial fi. In the j'th round of this sum-check protocol, V is required to send the univariate 

'^There are other reasons why using the multilinear extension is desirable. For example, the communication cost of the protocol 
is proportional to the degree of the extension used. 
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polynomial 

The sum defining gj involves as many as terms, and thus a naive implementation of V would require 
O(n^) time per iteration of the protocol. However, we observe that if the multilinear extension of the 
circuit's wiring predicate is used in the definition of fi, then each gate at layer i — 1 contributes to exactly 
one term in the sum defining gj, as does each gate at layer i. Thus, the polynomial gj can be computed with 
a single pass over the gates at layer i — 1, and a single pass over the gates at layer i. As the sum-check 
protocol requires 0{v) = 0{log S{n)) messages for each layer of the circuit, V requires logarithmically 
many passes over each layer of the circuit in total. 

A complication in applying the above observation is that V must process the circuit in order to pull out 
information about its structure necessary to check the validity of Vs messages. Specifically, each application 
of the sum-check protocol requires V to evaluate the multilinear extension of the wiring predicate of the 
circuit at a random point. Theorem 3.2 follows from the fact that for any log-space uniform circuit, V can 
evaluate the multiUnear extension of the wiring predicate at any point using space 0(log S{n) ). We present 
detailed proofs and discussions of the following theorems in Appendix A. 

Theorem 3.2 For any log-space uniform circuit of size S{n), V requires 0{S{n) log S{n)) time to imple- 
ment the protocol of Theorem 3.1 over the entire execution, and V requires space O {log S{n)). 

Moreover, because the circuit's wiring predicate is independent of the input, we can separate V's compu- 
tation into an offline non-interactive preprocessing phase, which occurs before the data stream is seen, and 
an onhne interactive phase which occurs after both V and V have seen the input. This is similar to [19, The- 
orem 4], and ensures that V is space-efficient (but may require time poly(5(n))) during the offline phase, 
and that V is both time- and space-efficient in the online interactive phase. In order to determine which 
circuit to use, V does need to know (an upper bound on) the length of the input during the preprocessing 
phase. 

Theorem 3.3 For any log-space uniform circuit of size S{n), V requires O {S{n) log S{n)) time to im- 
plement the protocol of Theorem 3.1 over the entire execution. V requires space 0{d{n) log S{n)) and 
time 0(poly(5(n))) in a non-interactive, data-independent preprocessing phase, and only requires space 
0{d{n) log S{n)) and time 0{n log n + d{n) log S{n)) in an online interactive phase, where the 0{n log n) 
term is due to the time required to evaluate the low -degree extension of the input at a point. 

Finally, Theorem 3.4 follows by assuming V can evaluate the multilinear extension of the wiring pred- 
icate quickly. A formal statement of Theorem 3.4 is in Appendix A. We believe that the hypothesis of 
Theorem 3.4 is extremely mild, and we discuss this point at length in Appendix A, identifying a diverse 
array of circuits to which Theorem 3.4 applies. Moreover, the solutions we adopt in our circuit-checking 
experiments for F2, Fq, and PMwW correspond to Theorem 3.4, and are both space- and time-efficient for 
the verifier. 

Theorem 3.4 (informal) Let C be any log-space uniform circuit of size S{n) and depth d{n), and as- 
sume there exists a O (log S{n))-space, poly {log S {n))-time algorithm for evaluating the multilinear ex- 
tension ofC's wiring predicate at a point. Then in order to to implement the protocol of Theorem 3.1 
applied to C, V requires 0{S{n) log S'(n)) time, and V requires space 0{logS{n)) and time 0(n log n -|- 
(i(n)poly(log S'(n))), where the O(nlogn) term is due to the time required to evaluate the low-degree 
extension of the input at a point. 
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3.2 Circuit Design Issues 

The protocol of [19] is described for arithmetic circuits with addition (+) and multiplication gates (x). This 
is sufficient to prove the power of this system, since any efficiently computable boolean function on boolean 
inputs can be computed by an (asymptotically) small arithmetic circuit. Typically such arithmetic circuits 
are obtained by constructing a boolean circuit (with AND, OR, and NOT gates) for the function, and then 
"arithmetizing" the circuit [2, Chapter 8]. However, we strive not just for asymptotic efficiency, but genuine 
practicality, and the factors involved can grow quite quickly: every layer of (arithmetic) gates in the circuit 
adds ?>v rounds of interaction to the protocol. Hence, we further explore optimizations and implementation 
issues. 

Extended Gates. The circuit checking protocol of [19] can be extended with any gates that compute low- 
degree polynomial functions of their inputs. If ^ is a polynomial of degree j, we can use gates computing 
(/(.x); this increases the communication complexity in each round of the protocol by at most j — 2 words, as 
V must send a degree -j polynomial, rather than a degree-2 polynomial. 

The low-depth circuits we use to compute functions of interest (specifically, Fq and PMwW) make use 
of the function f{x) = x^"^. Using only + and x gates, they require depth about log2p. If we also use 
gates computing g{x, y) = x^y^ for a small j, we can reduce the depth of the circuits to about log2j p; as 
the number of rounds in the protocol of [19] depends linearly on the depth of the circuit, this reduces the 
number of rounds by a factor of about log2p/log2j^3 = l/log2j 2. At the same time this increases the 
communication cost of each round by a factor of (at most) j — 2. We can optimize the choice of j. In 
our experiments, we use j = 4 (so g{x, x) is x^) and j = 8 (g{x, x) = x^^) to simultaneously reduce the 
number of messages by a factor of 3, and the communication cost and prover runtime by significant factors 
as well. 

Another optimization is possible. All four specific problems we consider, F2, Fq, PMwW, and MV- 
MULT, eventually compute the sum of a large number of values. Let / be the low-degree extension of the 
values being summed. For functions of this form, V can use a single sum-check protocol [2, Chapter 8] to 
reduce the computation of the sum to computing /(r) for a random point r. V can then use the protocol 
of [19] to delegate computation of /(r) to V. Conceptually, this optimization corresponds to replacing a 
binary tree of addition gates in an arithmetic circuit C with a single © gate with large fan-in, which sums all 
its inputs. This optimization can reduce the communication cost and the number of messages required by 
the protocol. 

General Circuit Design. The circuit checking approach can be combined with existing compilers, such 
as that in the Fairplay system [26], that take as input a program in a high-level programming language and 
output a corresponding boolean circuit. This boolean circuit can then be arithmetized and "verified" by our 
implementation; this yields a full-fledged system implementing statistically- secure verifiable computation. 
However, this system is likely to remain impractical even though the prover V can be made to run in time 
linear in the size of the arithmetic circuit. For example, in most hardware, one can compute the sum of two 
32-bit integers x and y with a single instruction. However, when encoding this operation into a boolean 
circuit, it is unclear how to do this with depth less than 32. At 3 log n rounds per circuit layer, for reasonable 
parameters, single additions can turn into thousands of rounds. 

The protocols in Section 3.3 avoid this by avoiding boolean circuits, and instead view the input directly 
as elements over Fp. For example, if the input is an array of 32-bit integers, then we view each element of 
the array as a value of ¥p, and calculating the sum of two integers requires a single depth- 1 addition gate, 
rather than a depth-32 boolean circuit. However, this approach seems to severely limit the functionality that 
can be implemented. For instance, we know of no compact arithmetic circuit to test whether x > y when 
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viewing x and y as elements of Fp. Indeed, if such a circuit for this function existed, we would obtain 
substantially improved protocols for Fq and PMwW. 

This polylogarithmic blowup in circuit depth compared to input size appears inherent in any construction 
that encodes computations as arithmetic circuits. Therefore, the development of general purpose protocols 
that avoid this representation remains an important direction for future work. 

3.3 Efficient Protocols For Specific Problems 

We obtain interactive protocols for our problems of interest by applying Theorem 3.1 to carefully chosen 
arithmetic circuits. These are circuits where each gate executes a simple arithmetic operation on its inputs, 
such as addition, subtraction, or multiphcation. For the first three problems, there exist specialized protocols; 
our purpose in describing these protocols here is to explore how the general construction performs when 
applied to specific functions of high interest. However, for PMwW, the protocol we describe here is the 
first of its kind. 

For each problem, we describe a circuit which exploits the arithmetic structure of the finite field over 
which they are defined. For the latter three problems, this involves an interesting use of Fermat's Little 
Theorem. These circuits lend themselves to extensions of the basic protocol of [19] that achieve quantitative 
improvements in all costs; we demonstrate the extent of these improvements in Section 5. 

Protocol for F2'. The arithmetic circuit for F2 is quite straightforward: the first level computes the square of 
input values, then subsequent levels sum these up pairwise to obtain the sum of all squared values. The total 
depth d is 0(log n). This impUes a 0(log^ n) message (log^ n, log^ n) protocol (as per Definition 1.2). 

Protocol for Fq: We describe a succinct arithmetic circuit over Fp that computes Fq. When p is a prime 

larger than n, Fermat's Little Theorem (FLT) implies that for x G Fp, = 1 if and only if x 7^ 0. 

Consider the circuit that, for each coordinate i of the input vector a, computes each a\~^ via O(logp) 
multiplications, and then sums the results. This circuit has total size O(ralogp) and depth O(logp). Ap- 
plying the protocol of [19] to this circuit, we obtain a (log n log p, logn) protocol where V runs in time 
O(nlognlogp). 

Protocol for MVMult: The first level of the circuit computes Aij-mi for all i, j, and subsequent levels sum 
these to obtain ^ijXj. Then we use FLT to ensure that ^i^Xj = hi for all i, via 



The input is as claimed if this final output of the circuit is (i.e. it counts the number of entries of b 
that are incorrect). This circuit has depth O(logp) and and size 0(n^ logp), and we therefore obtain an 
(n + logp log n, log n) protocol requiring 0(logp log n)-rounds, where V runs in time 0(n^ logp log n). 

Protocol for PMwW: To handle wildcards in both T (of length n) and P (of length q), we replace each 
occurrence of the wildcard symbol with 0; [13] notes that the pattern occurs at location i of T if and only if 



Thus, by FLT, it suffices to compute Yll=o ^f^ > which can be done naively by an arithmetic circuit of size 
0{nq + nlogp) and depth 0(logp + logg). We obtain a (logn log p, logn) protocol where V runs in time 
0(n log n(g + logp)). 





j=0 
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For large q, a further optimization is possible: the vector / can be written as the sum of a constant 
number of circular convolutions. Such convolutions can be computed efficiently using Fourier techniques 
in time 0{n\ogq) and, importantly, appropriate FFT and inverse FFT operations can be implemented via 
arithmetic circuits. Thus, for q larger than logp, we can reduce the circuit size (and hence 7^'s runtime) in 
this way, rather than by naively computing each entry of / independently. 

4 Multi-Round Protocols via Linearization 

In this section, we show how the technique of linearization can improve upon the general approach of Sec- 
tion 2.1 for some important functions. Specifically, this technique can be applied to multi-round protocols 
which would otherwise require polynomials of very high degree to be communicated. We show this in the 
context of new multi-round protocols for Fq and PMwW and we later empirically observe that our new 
protocol achieves a speed up of two orders of magnitude over existing protocols for Fq, as well as an order 
of magnitude improvement in communication cost. 

Existing approaches for Fq in the multi-round setting are based on generalizations of the multi-round 
protocol for F2 [17]. As described in [17], directly applying this approach is problematic: the central 
function in Fq maps non-zero frequencies to 1 while keeping zero frequencies as zero. Expressed as a 
polynomial, this function has degree m (an upper bound on the frequency of any item), which translates into 
a factor of m in the communication required and the time cost of V. However, this cost can be reduced to 
Foo, where Foo denotes the maximum number of times any item appears in the stream. Further, if both V 
and V keep a buffer of h input items, they can eliminate duphcate items within the buffer, and so ensure that 
Foo < m/h. This leads to an O(logn) message, (logn, logn) multi-round protocol with 'P's runtime 
being 0{F'^n\ogn) [17]. This protocol trades off increased communication for a quadratic improvement 
in the number of rounds of communication required compared to the protocol outlined in Section 3.3 above. 

4.1 Linearization Set-up 

In this section we describe a new multi-round protocol for Fq, and later explain how it can be modified for 
PMwW. This protocol has similar asymptotic costs as that obtained in Section 3.3, but in practice achieves 
close to two orders of magnitude improvement in P's runtime. The core idea is to represent the data as a 
large binary vector indicating when each item occurs in the stream. The protocol simulates progressively 
merging time ranges together to indicate which items occurred within the ranges. Directly verifying this 
computation would hit the same roadblock indicated above: using polynomials to check this would result in 
polynomials of high degree, dominating the cost. So we use a "Unearization" technique, which ensures that 
the degree of the polynomials required stays low, at the cost of more rounds of interaction. This uses ideas 
of Shen [34] as presented in [2, Chapter 8]. 

As usual, we work over a finite field with p elements, Fp. The input impUcitly defines an n x m matrix 
A such that ^ = 1 if the j'th item of the stream equals i, and ^ = otherwise. 

Working over the Boolean Hypercube. A key first step is to define an indexing structure based on the 

d-dimensional Boolean hypercube, so every input point is indexed by a d bit binary string, which is the 
(binary) concatenation of a log n bit string i and a log m bit stting j. We view yl as a function from {0, 1}*^ 
to {0, 1} via (xi, . . . , Xd) ^ A(^xi_,...,Xd) - Let / be the unique multiUnear polynomial in d variables such that 
/(xi, . . . , Xd) = A(^x-i_,...,xd) for all (xi, . . . , Xd) € {0, 1}*^, i.e. / is the multilinear extension of the function 
on {0, 1}*^ impUed by A. 
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The only information that the verifier V needs to keep track of is the value of / at a random point. That 
is, V chooses a random vector r = [ri, . . . ,rd) € F^. It is efficient for V to compute /(r) as V observes 
the stream which defines A (and hence /): when the j'th update is item i, this translates to the vector 
V = G {0, l}'^. The necessary update is of the form /(r) ^ /(r) + Xv(r), where Xv is the unique 
polynomial that is 1 at v and everywhere else in {0, 1}'^. For this, V only needs to store r and the current 
value of /(r). 

Linearization and Arithmetized Boolean Operators. We use three operators 11, 11 and L on polynomials 

g, defined as follows: 

Ukg{Xi, ...,Xk) =g{Xi, Xk-i,0) + g{Xi, X^-i, 1) 
-g{Xi, Xk-i,0) ■ g{Xi, Xk-i, 1). 

Ukg{Xi, ...,Xk)= g{Xi, Xk-i,0) ■ g{Xi, Xk-i, 1). 

Lig{Xi, . . . ,Xk) =Xi ■ g{Xi, . . . • • • ,Xk) 

+(1 - Xi) ■ g{Xi, . . .,Xi^i,0,Xi, . . .,Xk). 

n and n generalize the familiar "OR" and "AND" operators, respectively. Thus, if 3 is a /c-variate 
polynomial of degree at most j in each variable, Uk{g) and Iik{g) are A; — 1-variate polynomials of degree 
at most 2j in each variable. They generalize Boolean operators in the sense that if g{Xi, . . . , 0) = x 
and g{Xi, . . . , X^-i, 1) = y, and x, y are both or 1, then 

{Ukg){Xi,...Xk) = l iffx = lory = l, 
and {Ilkg){Xi, . . . X^) = 1 iff a; = 1 and y = 1. 

L is a linearization operator. If ^ is a A;-variate polynomial, Li{g) is a fc-variate polynomial that is linear 
in variable Xj. Li operations are used to control the degree of the polynomials that arise throughout the 
execution of our protocol. Since x^ = x for all j > 1, x G {0, 1}, Li{g) agrees with g{-) on all values in 

{0,1}^ 

Throughout, when applying a sequence of operations to a polynomial to obtain a new one, the operations 
are applied "right-to-left". For example, we write the A; — 1 variate polynomial 

(Li(L2 . . . as L1L2 . . . Lk-i 11^ g. 

Rewriting Fq and PMwW. For F2 and MVMult there is little need for linearization: the polynomials 
generated remain of low-degree, so the multi-round protocols described in [17, 15] already suffice. But 
linearization can help with Fq and PMwW. 

Thinking of the input as a matrix A as defined above, we can compute Fq by repeatedly taking the 
columnwise-OR of adjacent column pairs to end up with a vector which indicates whether item i appeared 
in the stream, then repeatedly summing adjacent entries to get the number of distinct elements. When 
representing these operations as polynomials, we make additional use of linearization operations to control 
the degree of the polynomials that arise. Using the properties of the operations 11 and Lj described above 
and rewriting in terms of the hypercube, it can be seen that 

1 1 

-^o(a) = ^ • • • ^ Lfc^Lfe^_i . . . Lillfej+i Lfe^+iLfe^ ...Li 11^^+2 • • • Ld-iLd-2 . . . -Li 11^ / (2) 

xi=0 a:fe=0 
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because this expression only involves variables and values in {0, 1}. The size of this expression is ^ — 

0(log2 n). 

The case for PMwW is similar. Assume for now that the pattern length g is a power of two (if not, it can 
be padded with trailing wildcards). We now consider the input to define a matrix A of size 2n x qn, such that 
^2i,qj+k{q-i) = 1 if the j'th item of the stream equals i, for all < /c < g — 1, and ^2i-i,gj+2fe = 1 if the 
fc'th character of the pattern equals i, for all < j < n — 1. Wildcards in the pattern or the text are treated as 
occurrences of all characters in the alphabet at that location. The problem is solved over this matrix A by first 
taking the column-wise "AND" of adjacent columns: this leaves 1 where a text character matches a pattern 
for a certain offset. We then take column-wise "OR"s of adjacent columns logn times: this collapses the 
alphabet. Taking row-wise "AND"s of adjacent rows log q times leaves an indicator vector whose ith entry 
is 1 iff the pattern occurs at location i in the text. Summing the entries in this vector provides the required 
answer. Using Unearization to bound the degree of 11 and 11 operators, we again obtain an expression of size 
0(log^ n). 



4.2 Protocols Using Linearization 

Given an expression in the form of (2), we now give an inductive description of the protocol. Conceptually, 
each round we ask the prover to "strip off" the left-most remaining operation in the expression. In the 
process, we reduce a claim by V about the old expression to a claim about the new, shorter expression. 
Eventually, V is left with a claim about the value of / at a random point (specifically, at r), which V can 
check against her independent evaluation of /(r). 

More specifically, suppose for some polynomial g{Xi, . . . , Xj),the prover can convince the verifier that 
g{ai,a2, ■ ■ ■ ,aj) = C with probability 1 for any (oi, 02, . . . , a^, C) where this is true, and probabihty less 
than e when it is false. Let U (Xi, X2, . . . ,Xi)he any polynomial on I variables obtained as 

U{Xi,X2,...,Xi) = Og{Xi,...,X,), 

where O is one of ^^.=0' ^^-=0' Uij=o ^* some variable i. (Thus / is j — 1 in the first three cases 
and j in the last). Let m be an upper bound (known to the verifier) on the degree of U with respect to X^. In 
our case, m < 2 because of the inclusion of Li operations in between every 11 and 11 operation. We show 
how V can convince V that U{ai,a2, ■ ■ ■ ,ai) = C with probability 1 for any (ai , 02 C") for which 

it is true and with probability at most e + d/p when it is false. By renaming variables if necessary, assume 
i = 1. The verifier's check is as follows. 

Case 1: O = J2xi=o- ^ provides a degree-1 polynomial s{Xi) that is supposed to be g{Xi,a2, ■ ■ ■ , aj). 
V checks if s(0) -I- = C If not, V rejects. If so, V picks a random value a G Fp and asks V to prove 
s{a) = g{a, 02, ... , aj). If it is one of the final d rounds, V chooses a to be the corresponding entry of r. 

Case 2: O = U^^^f^X or O = Ul^=iX. We do the same as in Case 1, but replace s(0) -I- s(l) with 
s(0) -I- s(l) — s(0)s(l) in the case of 11, or s(0)s(l) in the case of 11. 

Case 3: O = Li. V wishes to prove that U{ai,a2, . . . , a^) = C. V provides a degree-2 polynomial s{Xi) 
that is supposed to be g{X\, 02, • • • , ak)- We refer to this as "unbinding the variable" because previously Xi 
was "bound" to value ai, but now Xi is free. V checks that ais(O) + (1 — ai)s(l) = C". If not, V rejects. If 
so, V picks random a G Fp and asks V to prove s{a) = g{a, 02, ... , ak) (or if it is the final round, V simply 
checks that s{a) = /(r)). 

The proof of correctness follows by using the observation that if s{Xi) is not the right polynomial, then 
with probability 1 — m/p, V must prove an incorrect statement at the next round (this is an instance of 
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Table 1: Circuit checking results with n = 2^'. 



Schwartz-Zippel polynomial equality testing procedure [30]). The total probability of error is given by a 
union bound on the probabilities in each round, 0(log^ n/p). 

Analysis of protocol costs. Recall that both Fq and PMwW can be written as an expression of size 
0(log^ n) operators, where linearization bounds the degree in any variable. Under the above procedure, 
the verifier need only store r, /(r), the current values of any "bound" variables, and the most recent value 
of s(a). In total, this requires space 0(log n). There are 0(log^ n) rounds, and in each round a polynomial 
of degree at most two is sent from V to V. Such a polynomial can be represented with at most 3 words, so 
the total communication is 0(log^ n). Hence we obtain (log^ n, log n)-protocols for Fq and PMwW. 

As the stream is being processed the verifier has to update /(r). The updates are very simple, and 
processing each update requires 0{d) = O(logn) time. There is a slight overhead in PMwW, where each 
update in the stream requires the verifier to propagate q updates to / (assuming an upper bound on q is fixed 
in advance), taking 0{q) time. However, it seems plausible that these costs could be optimized further. 

The prover has to store a description of the stream, which can be done in space 0{n). The prover can 
be implemented to require 0(n log^ n) time: essentially, each round of the proof requires at most one pass 
over the stream data to compute the required functions. For brevity, we omit a detailed description of the 
implementation, the source code of which is available at [16]. 

Theorem 4.1 For any function which can be written as a concatenation oflogn (binary) operators drawn 
from Y^, n and 11 over inputs of size n, there is a log^ n round (log^ n, log n) protocol, where V takes time 
0{n log^ n), and V takes time 0(log^ n) to run the protocol, having computing the LDE of the input. 

Thus we can invoke this theorem for both Fq and PMwW, obtaining log^ n round (log^ n, log n) proto- 
cols for both. 

5 Experimental Evaluation 

We performed a thorough experimental study to evaluate the potential practical effectiveness of existing 
protocols and our new ones. We summarize our findings as follows. 

• The costs of our implementation of the general-purpose circuit-checking protocol described in Sec- 
tion 3 are extremely attractive, with the exception of 7-''s runtime. The prover takes minutes to operate 
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on input of size around 10^: ideally, this would take seconds. Tlie extensions we propose to the 
basic protocol of [19] (such as extra types of gates) result in significant quantitative improvements 
for our benchmark problems. We are optimistic about the prospects for further enhancements and 
parallelization to make practical general-purpose verification a reality. 

• Fine-tuned protocols for specific problems can improve over the general approach by several orders 
of magnitude. Specifically, we found that extremely practical non-interactive protocols processing 
hundreds of thousands of updates per second are achievable for a very large class of problems, but 
only by using the methods described in Section 2. We also found that the linearization technique 
results in significantly improved interactive protocols for Fq when compared to the more general 
circuit-checking approach. 

• Finally, we demonstrate that the non-interactive protocols are extremely amenable to parallelization, 
and we believe that this makes them an attractive option for practical use. 

In all of our experiments, the verifier requires significantly less space than that required to solve the 
problem without a prover, and requires about the same time as that required to solve the problem without a 
prover if given enough fast memory to store the whole input. Indeed, we found that in all of our protocols 
memory accesses are the speed bottleneck in both V's computation and in the computation required to solve 
the problem without a prover. 

Moreover, our circuit-checking results demonstrate that if we were to run our implementation on prob- 
lems requiring superUnear time to solve, then V would save significant time as well as space (compared to 
solving the problem without a prover). Indeed, except for circuits with very high (i.e., linear) depth, V's 
runtime in our circuit-checking implementation is grossly dominated by the time required to perform an 
LDE computation via a single streaming pass over the input. The verification time, excluding this cost, is 
essentially neghgible. 

5.1 Implementation Details 

All implementations were done in C++: we simulated the computations of both parties, and measured 
the time and resources consumed by the protocols. Our programs were compiled with g-i-i- using the -03 
optimization flag. For the data, we generated synthetic streams in which each item was picked uniformly 
at random from the universe, or in which frequencies of each item were chosen uniformly at random in the 
range [0, 1000]. The choice of data does not affect the runtimes, which depend only on the amount of data 
and not its content. Similarly the security guarantees do not depend on the data, but on the random choices 
of the verifier. All computations are over the field of size p = 2^^ — 1, implying a very low probability of 
the verifier being fooled by a dishonest prover. 

We evaluated the protocols on a multi-core machine with 64-bit AMD Opteron processors and 32 GB of 
memory available. Our scalability results primarily use a single core, but we also show results for parallel 
operation. The large amount of memory allowed us to experiment with universes of size several billion, 
with the prover able to store the full data in memory. We measured the time for V to compute the check 
information from the stream, for V to generate the proof, and for V to verify the proof. We also measured 
the space required by V, and the size of the proof provided by V. 

Choice of Field Size. While all the protocols we implemented work over arbitrary finite fields, our choice 
of Fp with p = 2^^ — 1 proves ideal for engineering practical protocols. First, the field size is large enough 
to provide a minuscule probability of error (which is proportional to l/p), but small enough that any field 
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Figure 1 : Experimental Results for both multi-round and non-interactive F2 protocols. 



element can be represented with a single 64-bit data type. By using native types, we achieve a speedup of 
several factors. Second, reducing modulo p can be done with a bit shift, a bit-wise AND operation, and an 
addition [35]. We experienced a speedup of nearly an order of magnitude by switching to this specialized 
"mod" operation rather than using "% p" operation in C-1-1-. Finally, the use of this particular field allows us 
to apply FFT techniques, as described in in Section 2 (recall 2^^ — 2 has many small prime factors). 

Correctness of protocols. In the protocols we study, the verifier's checks of the prover's claims are always 
very simple to implement: in many cases, each check takes a single line of code to ensure that the previous 
message is consistent with the new message^. Consequently, it is not difficult to implement the verifier in 
a bug-free manner, and once this is the case, the verifier's implementation serves as an independent check 
on the prover's implementation. This is because the verifier detects (with high probability) any deviations 
from the prescribed protocol, and in particular V detects deviations due to an incorrect prover. Thus, we are 
confident in the correctness of our implementations. More generally, this property can help in the testing 
and debugging of future implementations. 



5.2 Circuit Checking Protocols 

In our implementation of the circuit checking method described in Section 2. 1 , we put significant effort into 
optimizing the runtime of the prover, achieving an implementation for which V takes time nearly linear in 
the size of the circuit. Nonetheless, this cost remains the chief limitation of the implementation. 

We experimented with our implementation on circuits for three of our functions of interest: F2, Fq and 
PMwW. We leave circuits for MVMULT to future work. Results are summarized in Table 1. Throughout, 
when we refer to "P's runtime in an interactive protocol, we are referring to the total time over all rounds 
of the protocol. The speed per gate can be very high: V processed circuits with tens of millions of gates 
in a matter of minutes. For example, our basic implementation processed a circuit for Fq with close to 16 
million gates in under 9 minutes, or close to 30,000 gates per second. However, since the circuit's size was 
more than 100 times larger than the universe over which the input is drawn, this translated to only about 300 
items per second. The other costs incurred are very low. The verifier's space usage and the communication 
cost are never more than a few dozen kilobytess, and the verifier processes close to thirty million updates per 
second across all stream lengths. The time for V to run the protocol is negligible compared to the (already 
low) time to compute the required low-degree extension of the input. 

^Things are a little more complex in the case of circuit checking, as discussed in Section 3.2, but not dramatically so. 
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Figure 2: Experiments on non-interactive MVMULT protocols. 



In Section 3.2, we discuss how adding additional gate types can reduce the cost of circuit checking. We 
demonstrate experimentally that adding gates which compute the 8th power ("8) or the 16th power ("16) of 
their inputs achieves substantial reductions in the size of the circuits needed. For Fq, this reduced the number 
of rounds by nearly a factor of three, the prover time by close to 20%, and the overall communication cost 
by close to 30%. We also discuss in Section 3.2 how to (conceptually) replace a binary tree of addition gates 
with a single © gate of very large fan-in which sums all its inputs. For Fq, this optimization further reduced 
both communication and number of rounds by 10-20%. The effect of © gates was much more pronounced 
for F2, where we saw an order of magnitude reduction in the number of rounds, and 5-fold reduction in 
communication cost. The change was larger here because the addition gates represent a much larger fraction 
of the gates in F2 circuits than in Fq circuits. 



5.3 Specialized Protocols 

We now describe our experiments with specialized protocols on a problem-by-problem basis. We find 
that specialized interactive protocols improve over the general-purpose construction by several orders of 
magnitude. Moreover, we demonstrate that the FFT techniques of Section 2 yield non-interactive protocols 
that easily scale to streams with billions of updates, improving over previous implementations by three 
orders of magnitude. The protocols are of various types: the basic multi-round protocols based on sum- 
check from [17] (MRS); multi-round protocols which use linearization from Section 4 (LIN); multi-round 
protocols based on circuit checking described in Section 3 (CC); the basic non-interactive protocols from 
[8] (NI); and the faster implementation of these protocols via FFT in Section 2 (NI-FFT). 
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Figure 3: Experimental results for Fq. 



F2: There are four known protocols for F2: one obtained via the general-purpose circuit-checking approach 
(CC), a specialized interactive protocol due to [17] (MRS), a naive implementation of the non-interactive 
protocol due to [8] (NI), and a non-interactive implementation based on our FFT techniques developed in 
Section 2 (NI-FFT). The results for CC are for our optimized implementation using © gates. Figures 1(b) 
and 1(c) illustrate the verifier's time and space costs for all four protocols, while Figure 1(a) illustrates the 
prover's runtime for these protocols. We used implementations of NI and MRS protocols for F2 due to 
[17]. Note that in the case of NI and NI-FFT, the verifier behaves identically: the prover computes the same 
messages in both cases, but more quickly using FFT. 

The main observation from Figures 1(b) and 1(c) is that the verifier's costs are extremely low for all 
four protocols. V processed over 20 million items/s across all stream lengths for all protocols. The space 
usage and communication cost for both interactive protocols (CC and MRS) is less than 1 kilobyte across all 
stream lengths tested, while the space usage for the non-interactive case is much larger but still reasonable 
(comfortably under a megabyte even for stream lengths in excess of 1 billion). 

Figure 1(a) shows a clear separation between the four methods in "P's effort in generating the proof. For 
large streams, it is clear that NI is not scalable, with V's runtime growing like n'^/^; this implementation 
failed to process streams larger than about 40 million updates. In contrast, the FFT-based implementation 
of the non-interactive protocol processed between 350, 000 and 750, 000 items per second for all tested 
values of n, even for values of n well into the billions. Thus, the FFT techniques of Section 2 speed up 
"P's computation by several orders of magnitude compared to the naive implementation, and allowed the 
protocol to easily scale to streams with billions of items. As mentioned in Section 2, a wide variety of more 
complicated protocols use this protocols as a subroutine, and therefore these non-interactive techniques are 
as powerful as they are general. 

For the multi-round protocols, circuit checking (CC) eventually outpaces NI, and scales linearly: the 
CC prover processed about 20,000 items per second across all stream lengths. Finally, the multi-round 
prover processed 20-21 milUon items per second. We conclude that special-purpose protocols should have 
substantial value, as our specialized non-interactive protocol was faster than Circuit Checking by more than 
an order of magnitude, and the specialized interactive protocol was faster by two orders of magnitude. 

MVMULT.- Figure 2 shows the behavior of our FFT-based implementation of the (n^"*"", n^~") non- 
interactive protocol for MVMULT described in Section 2. Recall that the parameter a allows us to tradeoff 
between communication and space used by the verifier. A convenient (and previously unremarked on) fea- 
ture of this protocol is that when a = 0, the honest prover's message consists simply of the vector b. 
Consequently, we obtain an (n, n) protocol for which the prover can handle enormous throughputs: 30-50 
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Table 2: Non-interactive MVMult results on matrices of size 10,000 x 10,000 (763 MBs of data). 



million items/second as evidenced in Figure 2(b). In outsourcing settings where one can tolerate space usage 
0{n) for the verifier, this protocol is truly ideal, as the prover need do nothing more than solve the problem, 
and the verifier's computation consists only of maintaining n fingerprints. That is, this (n, n) protocol al- 
lows the user to obtain a strong security guarantee on the integrity of the query, almost for free. Note that for 
this problem, the size of the input is O(n^) for an n x n matrix, so 0(n) space at the verifier is still much 
smaller than the full input size. 

The behavior becomes more interesting when we set a > — ^in this case, in addition to providing the 
correct answer, the prover has to do non-trivial computation to prove correctness. Because lower values of a 
mean less space but more communication (see Figure 2(c)), setting a > may be needed when the verifier is 
severely space-limited. It may also be necessary when the matrix is very wide: in full generality the protocol 
has communication and space cost (mn", n^~") for an m x n matrix. We show how different costs vary 
as a function of a: V's time to process the input (Figure 2(a)), V's time (Figure 2(b)), the communication 
cost (Figure 2(d)), and the space used by V (Figure 2(c)). Across all values of a, V can process in excess 
of 1 million items per second using our FFT techniques. The verifier runs over the stream slightly faster for 
higher values of a, because V maintains fewer fingerprints for larger a's. When a = 0, V processed about 
20 million items per second, and when a = .25, V processed in excess of 30 million items per second. For 
concreteness. Table 2 displays the costs of the protocol when run on matrices of size 10,000 x 10,000. 

Fq: We implemented the (logn, ^/nlogu) interactive protocol of [17] described at the start of Section 4, 
which we refer to as the bounded protocol (B), since it uses a bound on F^, the maximum frequency of 
any item. We compare this to the new Linearization based protocol (LIN) from Section 4.1, as well as to the 
circuit checking approach (CC) of Section 2.1. The circuit-checking results shown are from our optimized 
implementation using "8 gates. 

Our focus is primarily on V's runtime, since we find that the bounded protocol is impractical for general 
streams: 7^'s runtime is Q{v?). However, recall from Section 4 that 'P's run time in the bounded protocol 
can be made 0{F^n) when there is an a priori upper bound on F^, or equivalently when V's memory is at 
least mjFo^ for streams of length m. Figure 3(a) shows Vs runtime for the bounded protocol as a function 
of the universe size n, for different bounds on F^o- 

Figure 3(a) shows that for fixed F^, the prover' s runtime in the bounded protocol grows linearly in n as 
expected. When F^o is very low, the protocol achieves reasonable throughputs, but as F^o grows the runtime 
rapidly becomes prohibitive. For example, Fqo = 30 gives about 80,000 items per second, while Fqo = 200 
results in just 1,600 items/second. It is clear that this protocol will be unacceptably slow for realistic streams 
where F^ is in the thousands or larger 

In contrast, T^'s runtime in the linearization and circuit checking protocols is independent of F^. For 
linearization, P's runtime grows slightly super-Mnearly in n (it is 0(nlog^ n) as shown in Section 4), and 
as a result the processing speed decreases slowly as the stream length increases (see Figure 3(b)). For short 
streams (e.g. n = 2^^), V handles about 17,000 items/second. For n = 2^"^, V handles about 8,000 items 
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per second. Extrapolating the behavior to streams of length about 1 billion, V should handle about 4,500 
items/second. These results are broadly consistent with its theoretical G (n log^ n) running-time bound, and 
represents a substantial improvement over the bounded protocol and the circuit checking protocol. In the 
circuit checking protocol V processes only 200-300 items per second across all stream lengths. 

Note, however, that the overhead for the verifier in all three protocols is very light, making the costs 
compelling from Vs perspective. In all protocols Vs space was always well under 1KB; this cost was 
so low for all three protocols that we have omitted the corresponding plot. For the circuit checking and 
bounded protocols, V processed about 20 million updates per second, while for the linearization protocol, V 
processed 3-5 million items/second. The verifier in the bounded and circuit-checking protocols is faster than 
in the linearization protocol because, in the first two, V only requires evaluating a log n-variate polynomial at 
a random point, while the linearization protocol requires evaluating a log n + log m-variate polynomial at a 
random point. The communication requirement grows larger for circuit checking and the bounded protocol, 
with the former approaching 100 KBs for universes of size 10 million, and the latter approaching similar 
amounts of communication when Fqo = 200. In contrast, the communication under linearization was an 
order of magnitude lower, never more than a few KBs on all streams tested. 

In summary, the bounded protocol may be preferable when is at most a very small constant (less 
than about 30); otherwise, the linearization protocol dominates, with the only downside being decreased 
throughput of the verifier. 

PMwW: Our experiments on pattern matching showed broadly the same relative trends as for Fq and are 
omitted for brevity. 

5.4 Parallel Implementations 

The prover's computations in all of the non-interactive protocols studied here are highly parallelizable, as 
noted previously. Indeed, using just three OpenMP^ statements, we were able to achieve more than a 7- 
fold speedup over the sequential implementation of the FFT protocol, by using all 8 cores of the multi-core 
machine our experiments were run on. Consequently, with 8 processors, the ratio between the speed of the 
MR and NI-FFT protocols for F2 drops from 20-60 to 3-8. In theory, the interactive F2 protocol is just as 
easy to parallelize as the non-interactive protocol; however, we did not find this to be the case in practice. 
The prover's computations in the multi-round protocol are so light-weight (as evidenced by its very high 
throughput) that memory access forms the principle bottleneck. In our test machine, all cores share a single 
pipe to memory, and the bottleneck remains. In other scenarios, such as each core having a separate pipeline 
to memory, multiple cores might yield more substantial speedups. 

6 Conclusion and Future Directions 

The ideas and techniques from interactive proof systems have transformed the landscape of computational 
complexity over the last two decades [3, 20]. Yet they have had relatively little practical impact thus far 
in the area of delegated computation. In this paper, we demonstrated that, when combined with significant 
engineering, interactive proof systems have sufficiently evolved to yield protocols suitable for everyday use. 

A particularly encouraging feature of our experimental results is that V"s runtime is dominated by the 
time required to evaluate the LDE of the input at a point r. For the low-complexity (linear or near linear 
time) computations we experimented on, this cost is actually comparable to the time required to solve the 

*http : / / www . openmp . org 
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problem without a prover, assuming V had enough memory to store the input. But if we were to run our 
implementation on problems requiring superlinear time to solve, then V would save significant time as well 
as space (compared to solving the problem without a prover). 

Moreover, if the cost of the LDE computation can be amortized over many queries, then V will save 
time as well as space even for very low-complexity functions. This is indeed possible for our non-interactive 
protocols, as there is no leakage of information from V to "P as long as V does not learn whether V accepts 
or rejects after each query; soundness is therefore maintained even if V uses the same r in all instances of 
the protocol. 

Such amortization for interactive protocols may also be possible in cases where V is not considered 
malicious, such as a user simply trying to detect a buggy algorithm. In this setting it is reasonable to use the 
same location r in all instances of the protocol even though soundness is not maintained theoretically. Thus, 
in these realistic situations, the amortized time cost to the verifier can be considerably sublinear in the input 
length, and our protocols will save the verifier both time and space. 

The next step is to further advance the boundary of practicality. The chief obstacle for more general 
systems is the requirement of a circuit representation for computations, and the superlinear dependence of 
the prover's time on the size of the circuit. Various approaches offer themselves: either to design protocols 
which circumvent this circuit representation, or to improve the throughput by taking greater advantage of 
the inherent parallelism in the prover's work, e.g. via GPU implementation. 
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A Details for Theorems 3.2-3.4 



In this Appendix, we spell out the details of our efficient instantiation of the construction of [19]. Our 
results ensure that the prover can be implemented efficiently, and that the verifier can be implemented very 
efficiently for a large class of circuits. 

A.l Notation and Background 

We adhere closely to the notation of [19]. We are given an arithmetic circuit C of gates with fan-in 2 over 
the field F. C is in layered form and has size S{n) and depth d{n), where n is the number of input wires. 
For presentation purposes, assume that all layers of the circuit have at most n gates, and write v = log n. 
For each 1 < i < d, we associate the j'th gate at layer i of C with the v-bii binary representation of j, and 
for z > 1, we define two functions, addi, m,ulti : {0, 1}^^ — {0, 1} which together constitute the wiring 
predicate of layer i of C. Specifically, these functions take as input three gate labels (ji, ^2, Ja), and return 
1 if gate ji at layer z — 1 is the addition (respectively, multiplication) of gates j2 and j3 at layer i, and return 
otherwise. We let addj, multj : F^'' — >■ F denote the multilinear extensions of addi and -multi respectively. 
That is, addj and multj are the unique multilinear polynomials over F that agree with addi and multi at all 
values in {0, 1}^^. 

We also define a function Vi : {0, 1}'' ^ F to represent the values of the gates at layer i. That is, Vi{j) 
equals the value of gate j at layer i. Let Vi -.W" ^¥ denote the multilinear extension of Vi. 

Recall from Section 2.1 that at iteration i of the protocol of [19], the sum-check protocol is applied to 
a certain 3u-variate polynomial fi. We are ready to give the definition of fi as promised. Given a vector 
X G F^", write p = {xi,. . . , Xv), oji = {xv+i, ■ ■ ■ , X2v) and UJ2 = {x2v+i, ■ ■, xsv)- Then we define 

fi(p,uji,uj2) ■■= Pip) (addiip,L0i,U2){Vi{uJi) + Vi(w2)) + multi(p, wi,a;2)Vi(a;i)Vi(a;2)) • (2) 
Here, addj, multj, and Vi are as above, and is a certain polynomial that depends only on p. 
A.l Making V Run in Time 0{S{n) log S{n)) 

In this subsection we show how to engineer an efficient prover. First we give an informal outline, then go 
on to make this more precise. 

A.2.1 ffigh-level Outline 

In the j'th round of the sum-check protocol applied to fi, V is required to send the univariate polynomial 

9j{Xj) = ^ fi{ri\ rfli,Xj,Xj+i, xsv). 

(a;j+i,..^3«)6{0,l}3"--'' 

Theorems 3.2, 3.3, and 3.4 rely on the observation that, when addj and mult^ are multilinear extensions, 
rather than arbitrary low-degree extensions, then each gate at layers i and i — 1 contributes to exactly one 
term in the sum. 

More specifically, the key observation is that the multilinear extension of the wiring predicate acts as a 
sum of variable-wise indicator functions on boolean-valued variables, with one indicator function for each 
gate at the layer of interest. At any round j of the sum-check protocol, the "unbound" variables (i.e., those 
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appearing in the sum defining gj) still only range over values in {0, 1}, and thus each gate y at the current 
layer of the circuit still contributes to only one term in the sum in intermediate rounds. Namely, y contributes 
to the unique term of the sum that agrees with the trailing bits in the binary representation of y, despite the 
fact that "bound" variables may take values outside of {0, 1}. 

A.2.2 Decomposing add^ and mult, as Sums of Variable-wise Indicator Functions 

Since addj and multj are the multilinear extensions of the wiring predicate, we can write them explicitly as 
follows. 

For y e {0, 1}^^ let Xy{xi, ■ ■ ■,X3v) = l\l=i Xyki^k), where xo{xk) = ^ - Xk and xi{xk) = Xk- Xy 
is the unique multilinear polynomial that takes y G {0, 1}^'' to 1 and all other values in {0, 1}'^^' to 0, i.e., it 
is the multilinear extension of the indicator function for boolean vector y. 

Notice that if {xj+i, x^y) G {0, l}^''"'', then for any (ri, . . . rj) G F-', 

/ N jUi=iXyiiri), ifxfc = yfcforallfc> j + 

Xy{ri, • ■ ■ , rj, xj+i, xsv) = S „ ^, . (3) 

10, otherwise. 

Informally, Equation (3) implies that one may think of Xy acting as a variable-wise indicator function 
on boolean-valued variables. 

Since addj and multj are multilinear extensions, they can be written as a sum of these Xy functions, 
where each gate y at layer i — 1 contributes a term Xy to the sum. That is, 

addi{xi,...,X3v) = ^ Xy{xi,---,X3v) (4) 

add gates y at layer i—1 

and 

multi(xi,. . . ,a;3^) = ^ Xy{xi, ■ ■ ■ ,X3v). (5) 

mult gates y at layer i—1 

It is straightforward to observe the expressions on the right hand sides of Equations (4) and (5) are 
multilinear polynomials that agree with addj and multj on boolean- valued inputs, and hence the right hand 
sides are equal to the multilinear extensions of addj and multj respectively. 

For any vector x = (xj+i, . . . x^^) € {0, 1}^""^, and for any (ri, . . . r^) G ¥^ , let x* denote the vector 

X* := {ri,...,rj,Xj+i,...,X3y) G F^'', 

and let denote the set of gates at layer i — 1 given by {y G {0,1}'^'' : yk = for all A; > j + 1}. 
Equations 4 and 5 imply that 

addj(x*)= (tl^min)], (6) 

add gates yeSx \l=l ) 



and similarly 



mult,(x*)= \\ixn{n)\ (7) 

mult gates ye Sx \i=l / 
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A.2.3 Completing the Calculation 

At round j of this sum-check protocol, the prover must compute the message 

Xj+i...X3v€{0,l}'i-"-J 

Since gj has degree three if we are using multilinear extensions, it suffices for the prover to send gj{rj) 
for Tj G {0, 1, 2}, as these evaluations uniquely define gj. 

Using Equations (6) and (7), we can now easily observe that each gate at layer i — 1 contributes to 
exactly one term in the sum. Specifically, for any term x = (xj+i . . . X3y) G {0, l}^''"-' in the sum, let x* 
denote the vector 

X* := (r«,...,rf ,x,+i,...,X3„)gF3- 

as before, and let p* G F'' be the first v entries of this vector, uf G F'' the middle v entries, and a;| G F'' the 
final V entries. Then combining Equations (6) and (7) with (2), we see 



fi{^*) = Pip*)- 



, add gates ye Sx \l=l 



J 




+ E [U^m (^') 1 1 • ^^K) • ^^(^2) I • (8) 

V mult gates ye 5x \l=l 

Each gate y at layer z — 1 is in for exactly one x G {0, l}^''"-' . Namely, x is the boolean vector equal 
to the last 3v — j bits of the binary representation of y. Denote this vector by x(y), and similarly let x*(y), 
p* (y), (y) and ^2 (y) denote the corresponding vectors implied by x(y). 

Equation (8) implies that y contributes only to the term x(y) of the sum defining gj{rj) for rj G 
{0, 1, 2}. That is, we may write 



add gates y at layer i—l \l=l / 

+ E ^(^'*(y))(rix.,(n))-v^^K(y))-Fi(c^2*(y))- 

mult gates y at layer i—l \/=l / 

Thus, the prover can compute gj{0), gj{l), and gj{2) with a single pass over the gates at layer i — l. 
By a similar calculation, all necessary Vi{uji) and Vi{uj2) for each message of the prover can be computed 
with a single pass over the gates at layer i. In conclusion, as long as we use the multilinear extension of the 
circuit's wiring predicate, the prover can compute each message at layer i with a single pass over the gates 
at layer i — l and a single pass over the gates at layer i, performing a constant number of field operations for 
each gate. Thus, the prover runs in time 0{S{n) log S{n)) in total, where S{n) is the size of the circuit. 
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A.3 Finishing the Proofs of Theorems 3.2 and 3.3 



We have demonstrated that if the protocol of [19] is instantiated with the multilinear extensions of the 
circuits wiring predicate and gate value function, then V can be made to run in time 0{S{n) \ogS{n)). 
All that remains in proving Theorem 3.2 is to show that for any log-space uniform circuit, the verifier can 
evaluate dAdi{p,LOi,0J2) and m\x\ii{p,oji,oj2) in space 0(log5(n)). This holds because V can make an 
"implicit" pass over each layer of the circuit and compute the contribution of each gate to addj and multj. 
That is, V considers each gate y in turn, and computes y's contribution to addj and multj using Equations 
(4) and (5). This requires 0{S{n)) time in total, but only 0{\ogS{n)) space, since V never needs to store 
an explicit representation of the circuit. Theorem 3.2 follows. 

Theorem 3.3 follows from the additional observation that addj and multj do not depend on the input, nor 
do the random coins of the verifier, and these coins uniquely determine the points at which V must evaluate 
addj and multj. Thus, V can toss all her coins in the pre-processing phase and compute the necessary 
evaluations of addj and multj. V stores the answers and the random coins for use in the online phase. In 
the online phase, V only needs to spend 0(1) time per round of the protocol to check P's messages for 
consistency, and thus V takes time 0{d{n) log S{n)) in the online phase. 

In streaming contexts, where V is more space-constrained than time-constrained, this may be acceptable. 
However, the solutions we adopt in our experimental implementation correspond to the stronger Theorem 
3.4, which further reduces the space and time costs for the verifier. 

A.4 Discussion and Formal Statement of Theorem 3.4 

Now that we have defined the polynomial fi to which the sum-check protocol is appUed in the i'th iteration 
of the construction of [19], we are ready to state Theorem 3.4 formally. 

Theorem A.l (Formal statement of Theorem 3.4.) Let C be a log-space uniform circuit of size S{n) and 
depth d{ri), and assume there exists an 0{log{S {n)))-space, 

poly log{S{n))-time algorithm for evaluating addi and multi at a point, for all layers i of the circuit. Then 
V requires 0{S{n) log5'(n)) time to implement the protocol of Theorem 3.1 over the entire execution. V 
requires space 0(log ^(n)) and time 0(n\ogn + d(n)poly(log S'(n))), where the O(nlogn) term is due 
to the time required to evaluate the low-degree extension of the input at a point. 

The remainder of this section is devoted to discussing the applicability of Theorem A.l. We believe 
the assumption that the multiUnear polynomials add^ and mult^ can be evaluated quickly by a small- space 
algorithm is mild, in both theory and practice. We demonstrate this in three ways. First, we show that all 
four motivating problems in this work possess succinct circuits to which Theorem A. 1 applies. Second, we 
identify a host of other important circuits from the algorithmic hterature to which Theorem A.I also applies. 
Third, we apply Theorem A.I to a complicated circuit appearing in the proof of [19, Corollary I], to obtain 
improved protocols for any language decidable by a (non-deterministic) Turing Machine in small space. 

In essence. Theorem A. 1 applies to any circuit with a "highly regular" wiring pattern; this explains why 
it applies to such a wide array of circuits. The details in the remainder of the section grow lengthy at times, 
but the thesis is clear: Theorem A.I applies to most circuits that arise in both practical applications and 
theoretical constructions. 

A.4.1 Wiring Predicates for F2, Fq, PMwW, and MVMult 

We demonstrate that Theorem A.I apphes to all four circuits described in Section 3. 
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Figure 4: A circuit for F2 on 4 inputs. 

1. F2: Recall that the circuit for F2 had a layer of multiplication gates used for computing the square 
of each input, and then subsequent levels formed a binary tree of addition gates used to sum up the 
results. A visual depiction of this circuit on n = 4 inputs is provided in Figure 4. 

First, consider layer d — 1 immediately above the input gates, which consists of multiplication gates 
used to square each input; both the in-neighbors of gate i at layer d — \ are equal to the i'th input gate. 
Therefore, if p = (pi , . . . , p„) G {0,1}'' denotes the boolean representation of a gate at layer d—1, and 
wi = (wi,!,. . . ,u!i^v) G {0, 1}^ and 0J2 = (^2,1,- • • ,'jJ2,v) G {0, 1}^ denote the boolean representation 
of two gates at the input layer, then multd evaluates to true if and only if p = a;i = uj2, while addd is 
identically zero. It is easily seen that the multilinear extension of multd is the polynomial 

multd(p,a;i,cj2) = nj=i(Pi^iJ^2j + (1 - wij)(l - W2j)), 

while the multilinear extension of addd is the zero polynomial. Clearly, mult^ can be evaluated at any 
point in F^" in time and space 0{v) = O(logn). 

The rest of the circuit for F2 consists of a binary tree of addition gates, which is used to sum up the 
squared item frequencies. Thus, multj is the zero polynomial for all i < d. Meanwhile, for z < d the 
predicate addj(pi, wi, ^2) evaluates to 1 if = 2p and U2 = 2p + 1, where here we are interpreting 
p, LOi, and UJ2 as integers. Thus, it can be seen that 

V 

addi(p,UJl,UJ2) = (1 - Wl,l)w2,l • {pj-lL0ljLJ2J + (1 -Pj-l){l - UJlj){l - U}2j)) . 

j=2 

Conceptually, the leading factor (1 — uji,i)uj2,i ensures that ui is even (i.e. its first bit is 0) and UJ2 is 
odd (i.e. its first bit is 1), while the expression 

n 

Y\ {pj-lUJljUJ2,j + (1 - Wij)(l - UJ2^j)) 

i=2 

ensures that the high-order n — 1 bits of uji and 002 agree with the bits of p. addj is therefore the unique 
multilinear polynomial evaluating to 1 on boolean inputs (p, wi, ^2) if = 2p and u}2 = 2p + 1, 
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/V^ /"V^ 

Figure 5: The first several layers of a circuit for Fq on three inputs (in place of a fourth input is a "constant" 
gate with value one) over the field ¥p with p = 2^^ — 1. The first layer from the bottom computes af for 
all i. The second layer from the bottom computes af and af for all i. The third layer computes af and 
af = af X af for all i, while the fourth layer computes aj^ and aj^ = af x for all i. The remaining layers 
(not shown) have structure identical to the third and fourth layers until the value af~^ is computed for all i, 
and the circuit culminates in a binary tree of addition gates. 

and evaluating to otherwise. Clearly addj can be evaluated at any point in time and space 0{v) = 
0(log n). This completes the description of addj and multj for all layers of the circuit for F2. 

2. Fq: Recall that for each of the n inputs Oj, the circuit for Fq from Section 3 computes af~^ via 
O(logp) multiplications, and then sums the results via a binary tree of addition gates. We have 
already seen the wiring predicate for binary trees, so here we only sketch the wiring predicate for the 
a^^ computation, omitting some details for brevity. We do so for the special case ofp = 2^^ - 1, 
which is the value of p used in our experiments, as this happens to have a particularly "regular" circuit 
for computing a^~^; the calculation would be similar but less symmetric for other values of p. 

We may write p — 1 = 2^^ — 2, whose binary representation is 60 Is followed by a 0. Thus, a^^^ = 
11^=1 The circuit computing a^"^ repeatedly squares a, and multiplies together the results "as it 
goes". In more detail, for i > 1 there are two multiplication gates at each layer d — i of the circuit 
for computing a^^^; the first computes a^' by squaring the corresponding gate at layer i — 1, and the 
second computes nj=i '^^^ • Figure 5 for a visual depiction of the first few layers of the Fq 
circuit. 

At a high level then, the wiring predicate multi{p, ^1,^2) tests equality of ui and uj2 with two strings 
that depend on the parity of p, as even values of p correspond to gates computing a^' while odd values 
correspond to gates computing nj=i • Thus, we may write 

multi (p, , a;2 ) = ( 1 - pi ) Xeven (p, wi , a;2 ) +pi Xodd (p, wi , a;2 ) , 

where Xodd and Xeven are multilinear extensions of the appropriate equality predicates, which do not 
depend on pi (we omit a precise definition of Xodd and Xeven for brevity). This can clearly be evaluated 
in 0{v) time and space. 

3. PMwW: The circuit for PMwW is similar to that for Fq so we omit the details for this circuit. 
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4. MVMult: The circuit described in Section 3 for MVMULT computes (y4x - b)j for all 1 < i < n, 
and then applies the circuit for Fq to the result. We have already sketched the wiring predicate for Fq, 
so we need only describe the wiring predicate of the circuit C computing (Ax — b) j for all 1 < i < n. 
For presentation purposes, we only describe the wiring predicate for a circuit C" which computes 
{Ax.)i for all i. The wiring predicate for C' is simpler than that of C, since C requires some extra 
gates to "propagate" the entries of b up to the final layer of the circuit, where they are finally used to 
compute {A-K — h)i for all 1 < i < ra. We emphasize that Theorem A.l applies to the circuit C as 
well. 

Assume n is a power of 2. To simplify the wiring predicate of C, we will treat C as having 2n^ 
inputs, where the first inputs of C' are the entries of A in row-major order; and the last n inputs 
are the entries of the vector x, with all the remaining inputs (between and 2n^ — n) set to and 
ignored in subsequent layers. We emphasize that this convention does not increase the costs to either 
or V in the protocol applied to C. 

Each of the 2n^ inputs can be specified with 1 + 2 log n bits. Conceptually, the first bit indicates 
whether the input specifies an entry of A (a zero indicates yes). The next log n bits specify the row 
of A, and are zero for any entry of x. The last n bits specify the column of A or the entry of x. 
We therefore write an input to a gate as h o i o j, where h G {0, 1}, i,j G {0, 1}", and o denote 
concatenation. 

Layer d — 1 of C computes AijXj for all 1 < i, j < n; there are therefore gates at this layer, so 
each gate can be specified with 2 log n bits. This layer consists only of multiplication gates, where the 
first input to gate p = ioj has bit representation o i o j, while the second input has bit representation 
loOoj. Thus, for J) G {0,l}2i°g",a;i,a;2 G {0, l}2i°g'^+\ 

addd{p,uji,uj2) = 0, 

while 

mu\td{p,UJi,UJ2) = (1 - wii) ^21- 

(logn \ 
JJ(PfeWi,fc+i + (l-i?fe) (l-a;i,fe+i)) (l-a;2,fe+i)| • 

(21ogn \ 
Yi (Pfe'^i,ik+i'^2,jfc+i + (l-Pfe) (l-a;i,fc+i) (l-W2,jfc+i)) j 
j=logn+l J 

Conceptually, the term (1 — wi^i) 0^2,1 ensures that the first bit of oji is 0, and the first bit of uj2 is 1. 
For p = io j, the term 

log n 

n (Pkl^lMl + (1 - Pk) (1 - '^l.fc+l)) (1 - <^2,k+l) 
k=l 

ensures that the next log n bits of oji equal i, while the corresponding bits of 0J2 are all 0. Finally, the 
term 

2 logn 

n (pfe'^i,fe+i'^2,fc+i+(i -Pfe) (i-^^i,fe+i) (1 -'^2,fc+i)) 

j=log n+1 
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ensures that the last log n bits of both cji and uj2 equal j. 

Subsequent layers of C compute Yl^=i ^ij^j foi" ^^^^ ^ < i < n, which is performed via a binary 
tree of addition gates for each i. We have already described the predicate for this wiring pattern in the 
paragraph on F2. 



A.4.2 Other Circuits 

Theorem A.l appUes to many other circuits that arise in the algorithms literature. Here we provide an 
incomplete list, sketching the necessary observations for each. 

1. Matrix Multiplication. Theorem A.l applies to the naive circuit of size O(n^) and depth O(logn) 
for multiplying two n x n matrices, which is similar to the circuit C described in Section A.4.1 for 

MVMULT. More generally, other multiplication algorithms, such as Strassen's algorithm, are also 
amenable to encoding as circuits, reducing the size to 0(n^ '^°^) in this case. We omit the details of 
these circuits for brevity. 

2. Rational permutations. Rational permutations have arisen in the study of memory hierarchies [1, 
10], and capture commonly-used operations such as matrix transposition and bit-reversal. Formally, 
a permutation 11 on [2"] is rational if it can be expressed as a permutation tt on bit positions i.e. 
n((,Ti, . . . , Xn)) = (2:71(1), ■ ■ ■ ■ -^71 (f,)) [10]- There is a two-layer circuit C of size n for performing 
any rational permutation (i.e. producing output wires that are the permutation of the input wires). Let 
the O'th input gate of C be a "constant gate" hard-coded to value zero. Each gate p at the non-input 
layer of C is an addition gate, whose first input is the constant gate, and whose second input is n(p). 
Then multi is the zero polynomial, while 

log n 

addi(p,a;i,a;2) = H ~ '^y) (Pj^^Mj) + i'^-Pj)i^ - ^2,7r{j))) ■ 
3=1 



Conceptually, the (1 — uju) term ensures that wi = while the term (pja;2,7r(j) + (l~Pj)(l~'^2,7r(j))) 
ensures that ljJ2 = n(jo). Clearly, addi can be evaluated at a point in polylog(n) time as long as 7r(i) 
can be evaluated in polylog(n) time for i G {0, l}i°g'°s". 

If a rational permutation is used as an intermediate step in a computation represented by a circuit C, 
then we need not explicitly materialize the above "rational permutation" circuit C as an intermediate 
layer i of the larger circuit C. Rather, we can simply modify the wiring predicate of layer i of C" 
to directly apply the rational permutation to its variables. That is, we replace addi{p,uji,uj2) and 
multj(p, (jji, (^12) with the polynomials addj(p, n(wi), n(a;2)) and multj(j), n(u;i), n(a;2))- It is easy 
to see that addi(p, n(u;i), n(a;2)) and multi(p, n(a;i), n(a;2)) are multilinear polynomials as long as 
n is a rational permutation, and these polynomials can be evaluated in polylog(n) time as long as 
7r(z) can be evaluated in polylog(n) time for i G {0, ijiogiog". 

3. Fourier Transform. Theorem A.l applies to an arithmetic circuit over the complex field C computing 
the standard radix-two decimation-in-time FFT (the most common form of the Cooley-Tukey algo- 
rithm [14]). Let a; e C" be the input vector, where n is a power of 2, and let X G C" denote the 
output vector. The radix-two decimation-in-time FFT relies on the following recursion: Denoting the 
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even-indexed inputs X2k by Ej^ and the odd-indexed inputs X2k+i by Ofe, it holds that 

_{Ek + e-'^^'^^/'^Ok ifk<n/2 
~ \Ek-n/2 + e-'"'^'/"Ofe-n/2 if^ > n/2 

The algorithm is sufficiently well-known that good introductions are readily available, along with 
illustrations of a circuit implementing the above recursion [36]. Essentially, the circuit performs a 
bit-reversal on its inputs (which can be implemented as a rational permutation described above), and 
then executes log n "stages", where the k'th output of stage i equals 

Viih, ...,K) = 0, h,..., K) + e-2-'=^/"y,_i(A;i, . . . , fe^.i, 1, h+i, . . . K). (9) 

Here Vi-i{k) denotes the value of the k'th output of the previous stage. 

The i'th stage can thus be implemented with two layers of gates; the first consists only of multiplica- 
tion gates, and serves to multiply the outputs of the previous stage by the appropriate twiddle factors 
(the terms of the form g-^'^ki/ny j^iq second layer consists only of addition gates, and combines 
outputs as in Equation (9). The wiring predicate of both layers essentially tests whether the A;'th bit of 
gate p is or 1 , and performs an appropriate equality test depending on the result. We have seen how 
to write equality tests of this form as succinct multilinear polynomials in the paragraph describing the 
circuit for Fq in Section A.4.1. 

A.4.3 More Efficient Protocols for Space-Bounded Computation 

Our final result of this section is to obtain more efficient protocols for any language decided by a non- 
deterministic Turing Machine in small space. In the full version of [19], Goldwasser, Kalai, and Rothblum 
obtain the following result. 

Lemma A.2 ([19], full version) Let C be any language solvable by a non-deterministic Turing Machine T 
in space s{n) = r2(logn) and time t{n). Then there is an arithmetic circuit C over an extension field of 
F2 computing L, where C has size S{n) = poly (2'^'^"''), and depth d{n) = 0(s(n) log i(n)). Moreover, 
for 1 <i < d{n), there exist polynomial extensions addi and multi of the functions addi and multi, where 
addi and multi have degree poly(s(n)) and can be evaluated at a point using space 0(log 5'(n)) and time 
poly(s(n)). 

We show that in fact the circuit C satisfies the following stronger property: 

Corollary A.3 Let C, addi '^nd multi t>e as in Lemma A.2. For 1 < i < d{n), the multihnear extensions 
addi cind multi of the functions addi o^nd multi, can be evaluated at a point using 0{\ogS{n)) words of 
memory and time poly(s(n)), while add^i^n) '^'^^ fnult^^^^) ca« be evaluated at a point using 0{logS{n)) 
words of memory and time 0{n ■ s{n) log n). 

Thus, Theorem A.l implies that in applying the protocol of [19] to C, the prover can be made to run 
in time 0{S{n) log S{n)), rather than poly(S'(n)), with a verifier who uses 0(log S{n)) space and runs in 
time 0{n ■ s(n) log n + d{n) polylog(5(n))), where S{n) is the size of C. Notice in particular that for any 
language in J\fC, the verifier runs in time 0(n log^ n). 

In essence, there are two sources of overhead in the protocol implied by Lemma A.2, where by overhead 
we mean the extra computation V must do to solve the problem verifiably, rather than just solving the 
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problem in an an unverifiable manner. First, there is overhead in representing a uniform computation as a 
(potentially large) circuit C rather than as a non-deterministic Turing Machine T. Second, there is additional 
overhead caused by the fact that in Lemma A. 2, the prover takes time superlinear in the size of C. Our results 
in this section remove the latter source of overhead, or at least reduce it to a logarithmic factor rather than 
polynomial factor, while maintaining a super-efficient verifier. 

Description of C. In order to present our result, we must first summarize the circuit C as defined in [19], 

which can be described as follows. The non-deterministic Turing Machine T is assumed without loss of 
generahty to have a unique accepting configuration. The circuit C consists of two stages: the first stage 
computes the adjacency matrix of the configuration graph of T on input x, which requires just a single layer 
of gates, while the second stage determines whether there is a path from the starting configuration of T on 
input X to the accepting configuration. The second stage determines whether such a path exists by a process 
resembhng repeated squaring of the adjacency matrix of T. 

More specifically, closely following the notation in the full version of [19], a configuration of T can be 
specified as a tuple u = {q, i,j,t) G {0, where g{n) = 0(1) -|- logn -|- logs(n) -I- s{n) = 0{s{n)). 
In this tuple, g is a boolean vector describing the machine's state (0(1) bits), i is the boolean representation 
of the location of the input-tape head (log n bits), j is the location of the work-tape head (log bits), and 
t represents the contents of the work tape (s(n) bits). The configuration graph G of T is a directed acyclic 
graph with 2^^'*) nodes, one for each configuration of T, and an edge from u to u if T can move in one step 
from configuration u to configuration v. We include self-loops in this graph. 

As in the full version of [19], let denote the adjacency matrix of T. The circuit C first computes the 
entries of B^, and then computes logt(n) matrices -Biog«(n)) • • • > -^o^ where the {u, w)'th entry of Bp is 1 
if there is a path of length at most 2'°s from u to v in G. The matrix Bp is obtained from -Bp+i by 
a process resembling repeated squaring of B^ using naive matrix multiphcation.^ The wiring structure of 
this stage of the circuit is similar to that for naive matrix-vector multiplication, and it is straightforward to 
observe that the multilinear extensions of addi and multi for these layers can be evaluated in 0(log S(n)) 
time and using 0(log S{n)) words of space. We omit these details for brevity. 

Multilinear Extension of the Remaining Layer. Thus, we need only show that the multiUnear extensions 
of the wiring predicate of the layer of C computing the entries of B^ can be evaluated using 0(log S{n)) 
words of memory and 0{n ■ s{n) log n) time. Assume that C has a designated input gate whose value is set 
to 0, and another whose value is set to 1; we call these the constant-0 and constant- 1 input gates, respectively. 
In determining the value of B^ [u,v], the full version of [19] demonstrates that there are 4 cases to consider. 
Notice configuration u only reads one input bit, bit Xi. 

1. Configuration u can always go to v, regardless of Xj. Then Bx[u, v] = 1. 

2. Configuration u can never to go v, regardless of Xj. Then Bx[u, v] = 0. 

3. Configuration u can go to v only if = 1. Then Bx[u, v] = Xi. 

4. Configuration u can go to v only if Xi = 0. Then Bx[u, v] = 1 + Xi, with arithmetic done over an 
extension field of F2. 

Thus, all gates are layer d{n) — 1 of C are addition gates. In Case 1 above, the first input to gate {u, v) is 
the constant-0 input gate, while the second is the constant- 1 input gate. In Case 2, both inputs to gate {u, v) 

^More speciflcaUy, Bp[u, v] = 1 + Ylwe{o lysM (1 + Bp[u, w]Bp[w, v]) , where all arithmetic is done over an extension field 

ofFg. 
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at layer d{n) — 1 equal the constant-0 input gate. In Case 3, the first input to gate (n, v) is the constant-0 
input gate, and the second input to gate (n, v) is the i'th input gate. In Case 4, the first input to gate {u, v) is 
the constant-1 input gate, and the second input to gate (u, v) is the i'th input gate. 

Write II = {qi,ii,ji,ti) andv = (^2,^2,^2,^2)- The fact that the multilinear extension addrf(„)(p,wi,a;2) 
can be evaluated using 0(log S(n)) words of memory and time 0(n • s{n) log n) relies on the fact that com- 
putation is local. More specifically, determining which of the four cases (n, v) is in depends only on the 
states qi and q2 (of which there are only 0(1) possibilities), the value of the jith bit in both ti and t2 (of 
which there are only 4 possibilities), determining whether the work-tape head can move j2 — ji locations 
to the right (this requires j2 — ji G 0, 1}, and hence there are only 0(s(n)) valid possibilities for 
j2 and ji), determining whether the input-tape head can move 12 — ii locations to the right (this requires 
12 — ii € {—1,0, 1}, and hence there are only 0{n) valid possibilities for i2 and ii), and determining 
whether all other entries of t\ and t2 are identical (the multilinear extension of this predicate is succinct). 

For example, p is in Case 1 iff 

1 . All bits of ti and t2 other than bit ji are equal, and 

2. Given state q\ and the value read by the work-tape head iij^, it holds that no matter the value of Xi, 
the non-deterministic machine can move to state q2, move its output-tape head — ji positions to the 
right, set bit ji of its work tape equal to t2,ji, and move its input head move 12 — ii positions to the 
right. 

Let 5 be the set of all values {qi,q2,h,i2,ji,j2,tij^,t2ji) G {0, l}'^^**^"')) values satisfying Property Two 
above. Notice all elements of 5* can be enumerated in time 0(n • s(n)). 

Forp= ((?i, ii, ji, ^1, g2, ^2, j2, ^2) £ {0, 1}^^^'^^ a;i,a;2 G {0,1}^°^", consider the multilinear polyno- 
mial 



where ps denotes the vector p restricted the entries corresponding to elements in S, Xx is the multilinear 
polynomial testing for equality with x, X5 is the multihnear polynomial for testing that all bits of and 
t2 Other than ii are equal, xo is the multihnear polynomial for testing that coi is equal to the index of the 
constant-0 gate, and xi is the multihnear polynomial for testing that 0^2 is equal to the index of the constant- 
1 gate. This polynomial can clearly be evaluated in in time 0{n • s(n) logn) using 0(logS'(n)) words of 
memory, and it evaluates to 1 on boolean input {p, wi , W2) to 1 if p is in Case 1 and ooi and W2 are as required 
by Case 1, and evaluates to zero otherwise. 

Similar polynomials xcase 2, XCase 3 , XCase 4 can be constructed for Cases 2-4. Thus, we can write 



XCasel(p,C^l,'^2) = i'^Xxips) | XsiPs)Xo{^l)Xl{^2), 



4 




which can clearly be evaluated in time 0{n ■ s{n) logn) and space 0(log 5(n)). 
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